Stephen J. Barr

csvfix is cool

I have been manipulating some data files in comma separated value format. I have been finding the tool, csvfix to be extremely cool.

For example, I have recently been working with some COMPUSTAT data. The data files are a few hundred megabytes and have several hundred columns, and hundreds of thousands of rows. I wanted all the rows, but only 30 or 40 of the columns, so I had subsetted the large files into much smaller ones. However, I forgot to get the SIC codes. So, rather than open all the files again, I used csvfix.

csvfix order -fn cusip,SIC *.csv | csvfix unique -o quarterly_cusip_SIC.csv

The first command grabbed the cusip and SIC columns from all the .csv files in the current directory. The output of this is redirected to the second command which takes the incoming csv data, separates out just the unique rows, then outputs them to the file quarterly_cusip_SIC.csv.

csvfix seems to be able to do much more than this. The manual looks pretty good – RTFM.


Leverage Ratio’s, Part 1

Using R, it is really easy to make a nice plot of leverage ratios. Using a `data.table` with Compustat annual fundamentals data, you can do the following calculation:

btdt <- transform(btdt, debt =  DLTT + DLC)
btdt <- transform(btdt, book_assets = AT)
btdt <- addVars(btdt, isAnnual = TRUE)
btdt <- transform(btdt, bl2 = debt/book_assets)

Then, you can plot it with `ggplot2`, like this:

x <- ggplot(data=firm.dt, aes(x=date, y=bl2)) + geom_line() + theme_bw()
x <- x + opts(title =  paste(firmname, "Leverage"))
x <- x + ylab("Debt / Total Assets")
print(x)
fname <- paste("plots/", firm, "_bookleverage.png", sep="")
ggsave(filename=fname, plot = x, dpi=600)

This is the result:

http://econsteve.com/wp-content/uploads/2011/12/wpid-CVX_bookleverage.png


Leverage Ratio’s, Part 2

I have been doing quite a bit of research into firm leverage ratios. Therefore, I spent the last few days looking at corporate leverage among a variety of dimensions. The data is entirely from the Compustat database.

Along what dimension?

A very simple first question to ask is, along what dimension am I looking at leverage?

There are actually quite a few possibilities:

  1. Take the average, pooled over all firms and time periods. However, it seems that there is more to say about the entire data set of corporate leverage ratios than can be expressed by one number.
  2. At every distinct point in time, pool all firms. This would create a 1-dimensional time series. This is doing a little better.
  3. For every period, compute means by industry. This creates a panel. However, doing this naievely creates a few problems (see the next section for help). At any given time, some firms enter the sample and some firms are leaving the sample. One way to mitigate this is take a slice of the data set known as a balanced panel, where the firms are chosen such that they exist for the entire window of time of interest. But, this introduces survivorship bias, because the sample is biased towards firms which endure all the economic woes of throughout the timeframe and do not leave the sample.
  4. Look at leverage by leverage decile, year by year. This is an interesting way to see what the bottom, top, and middle leverage levels are doing. It is important to realize that this does not necessarily capture what the firms at each level are doing. This is because firms are free to change their leverage and thus their relative usage of leverage may change over time.
  5. As a response to #4, At time period 1, sort leverage by decile and see which firms are in each decile. Then track these firms over time. This can be very interesting, taking caution that picking different starting time periods will most definitely change the groupings of firms.

There are plenty more.

Show me some results

A plot showing the subsetting as described in #5

http://econsteve.com/wp-content/uploads/2012/01/wpid-firm_booklev_decile_initial_sort.png

Looking at this plot, we see some interesting behaviour. The firms that were at an extremely high leverage ratio (book leverage > 0.6) they tended to delever to some extent, but still settled at a relatively high average when viewed as a group, where it seems like it took about 8 years to stabilize. Firms with relatively low leverage ratios seemed to, as a group, have very stable leverage ratios.

A plot showing the subsetting over time by industry

http://econsteve.com/wp-content/uploads/2012/01/wpid-leverage_time_series_by_industry.png

This plot is a bit harder to read. Construction and retail seem to have delevered, where mining seems to have increased in leverage. Definitely more investigation is needed.

Code

I wrote some code to detect the largest balanced panels in the Compustat database. That code is coming tomorrow. I picked a balanced panel that started in the mid 1980′s for this analysis. I used the data.table package to hold the data, as it allows for keyed data tables that allow extremely fast, in-place subsetting.

For the first plot, the code was as follows.

Subset the data to get just the first time period. Using cut2 from Hmisc library, divide into deciles and label them A..J.

min.date = min(dt.t$date);
dt.tmin = subset(dt.t, date == min.date);
dt.tmin$initialDEC = with(dt.tmin, cut2(book_lev, g=10), labels=1:10);
dt.tmin$initialDEC = factor(dt.tmin$initialDEC, labels = LETTERS[1:10]);

initialDEC is a factor representing the initial decile. The merge does exactly what you think it should, meaning add a column called initialDEC to the the original data set, dt.t, merging on the column cusip.

## prepare to merge
setkey(dt.tmin, cusip);
setkey(dt.t, cusip);

## do the merge
dt.t = dt.t[dt.tmin]

Create a factor called timedec, which is the interaction of the initial decile (initialDEC) and each time period. There should be 10 * (max(dt.t$year) - min(dt.t$year) + 1) distinct values of this factor. Then, set this as they key, and take the means according to this group.

dt.t$timedec = paste(dt.t$date, "--", dt.t$initialDEC, sep="")
setkey(dt.t, timedec)
td.means = dt.t[,list(tdmean=mean(book_lev)), by=timedec]

This unwraps the data. td.means becomes a nice little data.table with 10 * (max(dt.t$year) - min(dt.t$year) + 1) rows.

timedate =  strsplit(as.vector(td.means$timedec), "--")
tddf = as.data.frame(matrix(unlist(timedate), ncol=2, byrow=TRUE))
colnames(tddf) = c("DATE", "iDEC");
td.means$date = tddf$DATE;
td.means$iDEC = tddf$iDEC;
td.means$year = extractyear(td.means$date)

Then, use ggplot2 to do the plotting:

p = ggplot(td.means, aes(x=year, y=tdmean, group=iDEC))
p = p + geom_line(aes(colour=iDEC))
p = p + opts(title = expression("Firm leverage at t=0 decile sort"))
p = p + scale_y_continuous('Book leverage')
print(p)

Coming next

How to use R (and lots of cores) to find all possible balanced panels in a ragged data set.


ess-remote, so useful!

An extremely useful command Emacs command for ESS (Emacs Speaks Statistics) is ess-remote.

Quoting the documentation:

Execute this command from within a buffer running a process. It runs `ess-add-ess-process’ to add the process to `ess-process-name-alist’ and to make it the `ess-current-process-name’. It then prompts the user for an ESS language and sets the editing characteristics appropriately.

Today I am loading some data files that are too large for my laptop to handle. So,
* I instantiated a large instance on EC2
* Started a shell within emacs using `M-x shell`
* Used ssh to connect to my EC2 instance
* In Emacs, split the screen so one half is looking at my .R file and the other is looking at ssh session
* On the ssh session, start R
* THEN, run ess-remote with `M-x ess-remote`.

Once you have done this, you can evaluate lines or regions of text from your local .R script, but have the execution all take place on the EC2 machine.


Eurozone Crisis Links

This article:

* http://www.marketwatch.com/story/imf-rescue-of-italy-will-spark-global-uprising-2011-11-30?Link=obinsite “But whatever it is called, it is not going to work. Why not? Because it puts too much financial strain on the rest of the world. Indeed, if the IMF goes ahead with the plan, it might well finish itself off as a serious custodian of the world’s financial stability.”
* http://www.npr.org/templates/story/story.php?storyId=143015809

Both are good articles. I believe I heard the NPR one while running an errand this evening.

My personal opinion: I am glad that Angela Merkel is not backing the idea of a Eurobond. It seems clear that that scenario would eventually degernate into a borrowing contest. This would be much the same situation as a group of people agreeing beforehand to evenly split a restaurant bill before everyone orders their dinner. In both cases, social pressures aside, it is to the advantage of each party to rack up an above average bill. And, everyone can’t be above average at the same time.


Managing your bibtex

I am working on reviewing a somewhat large body of literature in preparation for a big paper. And, in the process, I decided to make sure to more properly manage my bibtex database. If you are working in LaTeX and don’t know what bibtex is, google and find out. It is the way to manage references for LaTeX papers.

The tool I am currently using for this is http://jabref.sourceforge.net/. It seems to be doing what I want it to do.

Important shortcut keys

  • C-n new entry
  • C-g clean up pasted in bibtex

My workflow then is typically, find paper. Most cites that host papers export BibTex. Get that. C-[TAB] to JabRef, and paste in the BibTex. Then C-g to clean up and automatically set the key name.

Also, if you download a PDF paper, you can drag it into JabRef and it makes the creation of an entry fairly easy. There is plenty more that JabRef does, but so far I have found it useful


Parallel Computing for Econometricians with Amazon Web Services

This past quarter, I had two fantastic class. One was corporate finance with Toni Whited. The other was advanced econometrics with Sanjog Misra. In both classes, I heavily used EC2. The first one I would like to document is a presentation I made for Sanjog’s class. The title of the presentation was “Parallel Computing for Econometricians with Amazon Web Services”, available in PDF.

I really enjoyed the opening:

The presentation documents using EC2 with either Segue or Elastic Map Reduce to demonstrate Monte Carlo simulations and evaluating large likelihood functions using EC2 and R. I have some sample code up here. Usage is described in the presentation. The technique is simulated maximum likelihood.

In the near future, I plan to more neatly package this demonstration and perhaps add a screencast. But, this should get you started.


I’ll admit it. Sometimes, while I am do…

I’ll admit it. Sometimes, while I am doing easy work (e.g. TeX’ing my work), I’ll have a movie running in the background. Usually it is a movie that I am familiar with, so it doesn’t distract me. Right now, I am watching this classic: http://www.imdb.com/title/tt0095705/

In the beginning, O.J. Simpson’s character gets shot. As the 2 detectives are on their way to the hospital, one says to Leslie Nielsen, “he only has a 50/50 chance of living, but there’s only a 10% chance of that”. From a perceptions point of view, would you rather have it phrased this way, or just hear 5%?