Analysis of Public .Rhistory Files
GitHub recently launched a more powerful search feature which has been used on more than one occasion to identify sensitive files that may be hosted in a public GitHub repository. When used innocently, there are all sorts of fun things you can find with this search feature.
Inspired by Aldo Cortesi's post documenting his exploration of public shell history files posted to GitHub, I was curious if there were any such .Rhistory
files. For the uninitiated, .Rhistory
files are just logs of commands entered into the interactive console during an R session. Some recent IDEs, such as RStudio, automatically create these files as you work. By default, these files would be excluded from a Git repository, but users could, for whatever reason, choose to include their .Rhistory
files in the repository.
Using this search function, combined with the Python script Mr. Cortesi had put together to download the files associated with a GitHub search, I was able to download 638 .Rhistory
files from public GitHub repositories (excluding forks). What follows is an exploration of those files.
Load Data
Trimming out the 0-line .Rhistory
files leaves us with a total of 531
non-empty files, totaling 157265
commands entered into R.
First, I was curious about the length of these files.
It seems that many of these files represent very brief (and likely unpleasant) interaction with R. For instance:
exit exit ls exit
(if you're out there, you were likely looking for the 'q()
' command). Others represent quite extensive projects; the maximum was 7268
lines long.
Package Usage
More interesting to me was how these users were using R – what the details contained in these history files represent in terms of the user's interaction with R. For starters, which packages were the users using? We can identify packages loaded via the library()
or require()
functions.
There were 3068
such calls to load packages in the scripts. The top 10 packages loaded in this set were:
Package Name | Count |
---|---|
ggplot2 |
291 |
plyr |
81 |
GREBase |
59 |
xtable |
59 |
reshape |
52 |
reshape2 |
48 |
devtools |
43 |
igraph |
41 |
RGreenplum |
40 |
lattice |
39 |
(Of course it's likely worth noting the selection bias from examining only R commands which were included in GitHub projects. I would imagine that the usage for devtools
, for instance, is certainly inflated among GitHub projects over the general populace.)
Function Use
I was also curious which functions were most widely executed. We can get a rough identification of most function names by looking for a sequence of valid characters followed by an (
symbol.
This gives us a total of 100190
function calls of 8028
unique function names. The 20 most popular functions executed in this set were:
Function Name | Count |
---|---|
source |
5191 |
plot |
2552 |
c |
2448 |
library |
2416 |
function |
1711 |
for |
1138 |
summary |
1107 |
if |
1062 |
read.csv |
955 |
rep |
887 |
length |
880 |
head |
828 |
lm |
766 |
sum |
753 |
View |
722 |
print |
661 |
install.packages |
648 |
mean |
606 |
setwd |
569 |
names |
562 |
It should also be possible to identify for which functions the help/manual pages were viewed by identifying lines beginning with a “?” or arguments inside of a call to help()
.
I can identify 2409
requests for help on 1101
different function names. The top 10 most prevalent functions for which users request help follow.
Function Name | Count |
---|---|
plot |
43 |
hist |
31 |
lm |
25 |
writePage |
24 |
order |
20 |
sort |
20 |
cor |
17 |
apply |
16 |
read.csv |
16 |
matrix |
15 |
Conclusion
Of course, there are all sorts of different types of analysis one could perform on this dataset. Post any suggestions you have in the comments; I imagine there's at least one more post of interesting finds in this data. Check out the source code on GitHub.