21 February 2013

Analysis of Public .Rhistory Files

GitHub recently launched a more powerful search feature which has been used on more than one occasion to identify sensitive files that may be hosted in a public GitHub repository. When used innocently, there are all sorts of fun things you can find with this search feature.

Inspired by Aldo Cortesi's post documenting his exploration of public shell history files posted to GitHub, I was curious if there were any such .Rhistory files. For the uninitiated, .Rhistory files are just logs of commands entered into the interactive console during an R session. Some recent IDEs, such as RStudio, automatically create these files as you work. By default, these files would be excluded from a Git repository, but users could, for whatever reason, choose to include their .Rhistory files in the repository.

Using this search function, combined with the Python script Mr. Cortesi had put together to download the files associated with a GitHub search, I was able to download 638 .Rhistory files from public GitHub repositories (excluding forks). What follows is an exploration of those files.

Load Data

Trimming out the 0-line .Rhistory files leaves us with a total of 531 non-empty files, totaling 157265 commands entered into R.

First, I was curious about the length of these files.

 

Length of RHistory Files

 

It seems that many of these files represent very brief (and likely unpleasant) interaction with R. For instance:

exit
exit
ls
exit

(if you're out there, you were likely looking for the 'q()' command). Others represent quite extensive projects; the maximum was 7268 lines long.

Package Usage

More interesting to me was how these users were using R – what the details contained in these history files represent in terms of the user's interaction with R. For starters, which packages were the users using? We can identify packages loaded via the library() or require() functions.

There were 3068 such calls to load packages in the scripts. The top 10 packages loaded in this set were:

Package Name Count
ggplot2 291
plyr 81
GREBase 59
xtable 59
reshape 52
reshape2 48
devtools 43
igraph 41
RGreenplum 40
lattice 39

(Of course it's likely worth noting the selection bias from examining only R commands which were included in GitHub projects. I would imagine that the usage for devtools, for instance, is certainly inflated among GitHub projects over the general populace.)

Function Use

I was also curious which functions were most widely executed. We can get a rough identification of most function names by looking for a sequence of valid characters followed by an ( symbol.

This gives us a total of 100190 function calls of 8028 unique function names. The 20 most popular functions executed in this set were:

Function Name Count
source 5191
plot 2552
c 2448
library 2416
function 1711
for 1138
summary 1107
if 1062
read.csv 955
rep 887
length 880
head 828
lm 766
sum 753
View 722
print 661
install.packages 648
mean 606
setwd 569
names 562

It should also be possible to identify for which functions the help/manual pages were viewed by identifying lines beginning with a “?” or arguments inside of a call to help().

I can identify 2409 requests for help on 1101 different function names. The top 10 most prevalent functions for which users request help follow.

Function Name Count
plot 43
hist 31
lm 25
writePage 24
order 20
sort 20
cor 17
apply 16
read.csv 16
matrix 15

Conclusion

Of course, there are all sorts of different types of analysis one could perform on this dataset. Post any suggestions you have in the comments; I imagine there's at least one more post of interesting finds in this data. Check out the source code on GitHub.