Introductory Point Pattern Analysis of Open Crime Data in London

May 21, 2015, 7:21 am

≫ Next: sjmisc – package for working with (labelled) data #rstats

≪ Previous: Course on using Oracle R Enterprise

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

Introduction
Police in Britain (http://data.police.uk/) not only register every single crime they encounter, and include coordinates, but also distribute their data free on the web.
They have two ways of distributing data: the first is through an API, which is extremely easy to use but returns only a limited number of crimes for each request, the second is a good old manual download from this page http://data.police.uk/data/. Again this page is extremely easy to use, they did a very good job in securing that people can access and work with these data; we can just select the time range and the police force from a certain area, and then wait for the system to create the dataset for us. I downloaded data from all forces for May and June 2014 and it took less than 5 minutes to prepare them for download.
These data are distributed under the Open Government Licence, which allows me to do basically whatever I want with them (even commercially) as long as I cite the origin and the license.

Data Preparation
For completing this experiment we would need the following packages: sp, raster, spatstat, maptools and plotrix.
As I mentioned above, I downloaded all the crime data from the months of May and June 2014 for the whole Britain. Then I decided to focus on the Greater London region, since here the most crimes are committed and therefore the analysis should be more interesting (while I am writing this part I have not yet finished the whole thing so I may be wrong). Since the Open Government License allows me to distribute the data, I uploaded them to my website so that you can easily replicate this experiment.
The dataset provided by the British Police is in csv format, so to load it we just need to use the read.csv function:

data <- read.csv("http://www.fabioveronesi.net/Blog/2014-05-metropolitan-street.csv")

We can look at the structure of the dataset simply by using the function str:

> str(data)
'data.frame':   79832 obs. of  12 variables:
 $ Crime.ID             : Factor w/ 55285 levels "","0000782cea7b25267bfc4d22969498040d991059de4ebc40385be66e3ecc3c73",..: 1 1 1 1 1 2926 28741 19664 45219 21769 ...
 $ Month                : Factor w/ 1 level "2014-05": 1 1 1 1 1 1 1 1 1 1 ...
 $ Reported.by          : Factor w/ 1 level "Metropolitan Police Service": 1 1 1 1 1 1 1 1 1 1 ...
 $ Falls.within         : Factor w/ 1 level "Metropolitan Police Service": 1 1 1 1 1 1 1 1 1 1 ...
 $ Longitude            : num  0.141 0.137 0.14 0.136 0.135 ...
 $ Latitude             : num  51.6 51.6 51.6 51.6 51.6 ...
 $ Location             : Factor w/ 20462 levels "No Location",..: 15099 14596 1503 1919 12357 1503 8855 14060 8855 8855 ...
 $ LSOA.code            : Factor w/ 4864 levels "","E01000002",..: 24 24 24 24 24 24 24 24 24 24 ...
 $ LSOA.name            : Factor w/ 4864 levels "","Barking and Dagenham 001A",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Crime.type           : Factor w/ 14 levels "Anti-social behaviour",..: 1 1 1 1 1 3 3 5 7 7 ...
 $ Last.outcome.category: Factor w/ 23 levels "","Awaiting court outcome",..: 1 1 1 1 1 21 8 21 8 8 ...
 $ Context              : logi  NA NA NA NA NA NA ...

This dataset provides a series of useful information regarding the crime: its locations (longitude and latitude in degrees), the address (if available), the type of crime and the court outcome (if available). For the purpose of this experiment we would only need to look at the coordinates and the type of crime.
For some incidents the coordinates are not provided, therefore before we can proceed we need to remove NAs from data:

data <- data[!is.na(data$Longitude)&!is.na(data$Latitude),]

This eliminates 870 entries from the file, thus data now has 78'962 rows.

Point Pattern Analysis
A point process is a stochastic process for which we observe its results, or events, only in a specific region, which is the area under study, or simply window. The location of the events is a point pattern (Bivand et al., 2008).
In R the package for Point Pattern Analysis is spatstat, which works with its own format (i.e. ppp). There are ways to transform a data.frame into a ppp object, however in this case we have a problem. The crime dataset contains lots of duplicated locations. We can check this by first transform data into a SpatialObject and then use the function zerodist to check for duplicated locations:

> coordinates(data)=~Longitude+Latitude
> zero <- zerodist(data)
> length(unique(zero[,1]))
[1] 47920

If we check the amount of duplicates we see that more than half the reported crimes are duplicated somehow. I checked some individual cases to see if I could spot a pattern but it is not possible. Sometime we have duplicates with the same crime, probably because more than one person was involved; in other cases we have two different crimes for the same locations, maybe because the crime belongs to several categories. Whatever the case the presence of duplicates creates a problem, because the package spatstat does not allow them. In R the function remove.duplicates is able to get rid of duplicates, however in this case I am not sure we can use it because we will be removing crimes for which we do not have enough information to assess whether they may in fact be removed.

So we need to find ways to work around the problem.
This sort of problems are often encountered when working with real datasets, but are mostly not referenced in textbook, only experience and common sense helps us in these situations.

There is also another potential issue with this dataset. Even though the large majority of crimes are reported for London, some of them (n=660) are also located in other areas. Since these crimes are a small fraction of the total I do not think it makes much sense to include them in the analysis, so we need to remove them. To do so we need to import a shapefile with the borders of the Greater London region. Natural Earth provides this sort of data, since it distributes shapefiles at various resolution. For this analysis we would need the following dataset: Admin 1 – States, Provinces

To download it and import it in R we can use the following lines:

download.file("http://www.naturalearthdata.com/http//www.naturalearthdata.com/download/10m/cultural/ne_10m_admin_1_states_provinces.zip",destfile="ne_10m_admin_1_states_provinces.zip")
unzip("ne_10m_admin_1_states_provinces.zip",exdir="NaturalEarth")
border <- shapefile("NaturalEarth/ne_10m_admin_1_states_provinces.shp")

These lines download the shapefile in a compressed archive (.zip), then uncompress it in a new folder named NaturalEarth in the working directory and then open it.

To extract only the border of the Greater London regions we can simply subset the SpatialPolygons object as follows:

GreaterLondon <- border[paste(border$region)=="Greater London",]

Now we need to overlay it with crime data and then eliminate all the points that do not belong to the Greater London region. To do that we can use the following code:

projection(data)=projection(border)
overlay <- over(data,GreaterLondon)
 
data$over <- overlay$OBJECTID_1
 
data.London <- data[!is.na(data$over),]

The first line assigns to the object data the same projection as the object border, we can do this safely because we know that the crime dataset is in geographical coordinates (WGS84), the same as border.
Then we can use the function over to overlay the two objects. At this point we need a way to extract from data only the points that belong to the Greater London region, to do that we can create a new column and assign to it the values of the overlay object (here the column of the overlay object does not really matter, since we only need it to identify locations where this has some data in it). In locations where the data are outside the area defined by border the new column will have values of NA, so we can use this information to extract the locations we need with the last line.

We can create a very simple plot of the final dataset and save it in a jpeg using the following code:

jpeg("PP_plot.jpg",2500,2000,res=300)
plot(data.London,pch="+",cex=0.5,main="",col=data.London$Crime.type)
plot(GreaterLondon,add=T)
legend(x=-0.53,y=51.41,pch="+",col=unique(data.London$Crime.type),legend=unique(data.London$Crime.type),cex=0.4)
dev.off()

This creates the image below:

Now that we have a dataset of crimes only for Greater London we can start our analysis.

Descriptive Statistics
The focus of a point pattern analysis is firstly to examine the spatial distribution of the events, and secondly making inferences about the process that generated the point pattern. Thus the first step in every point pattern analysis, as in every statistical and geostatistical analysis, is describe the dataset in hands with some descriptive indexes. In statistics we normally use mean and standard deviation to achieve this, however here we are working in 2D space, so things are slightly more complicated. For example instead of computing the mean we compute the mean centre, which is basically the point identified by the mean value of longitude and the mean value of latitude:

Using the same principle we can compute the standard deviation of longitude and latitude, and the standard distance, which measures the standard deviation of the distance of each point from the mean centre. This is important because it gives a measure of spread in the 2D space, and can be computed with the following equation from Wu (2006):

In R we can calculate all these indexes with the following simple code:

mean_centerX <- mean(data.London@coords[,1])
mean_centerY <- mean(data.London@coords[,2])
 
standard_deviationX <- sd(data.London@coords[,1])
standard_deviationY <- sd(data.London@coords[,2])
 
standard_distance <- sqrt(sum(((data.London@coords[,1]-mean_centerX)^2+(data.London@coords[,2]-mean_centerY)^2))/(nrow(data.London)))

We can use the standard distance to have a visual feeling of the spread of our data around their mean centre. We can use the function draw.circle in the package plotrix to do that:

jpeg("PP_Circle.jpeg",2500,2000,res=300)
plot(data.London,pch="+",cex=0.5,main="")
plot(GreaterLondon,add=T)
points(mean_centerX,mean_centerY,col="red",pch=16)
draw.circle(mean_centerX,mean_centerY,radius=standard_distance,border="red",lwd=2)
dev.off()

which returns the following image:

The problem with the standard distance is that it averages the standard deviation of the distances for both coordinates, so it does not take into account possible differences between the two dimensions. We can take those into account by plotting an ellipse, instead of a circle, with the two axis equal to the standard deviations of longitude and latitude. We can use again the package plotrix, but with the function draw.ellipse to do the job:

jpeg("PP_Ellipse.jpeg",2500,2000,res=300)
plot(data.London,pch="+",cex=0.5,main="")
plot(GreaterLondon,add=T)
points(mean_centerX,mean_centerY,col="red",pch=16)
draw.ellipse(mean_centerX,mean_centerY,a=standard_deviationX,b=standard_deviationY,border="red",lwd=2)
dev.off()

This returns the following image:

Working with spatstat
Let's now look at the details of the package spatstat. As I mentioned we cannot use this if we have duplicated points, so we first need to eliminate them. In my opinion we cannot just remove them because we are not sure about their cause. However, we can subset the dataset by type of crime and then remove duplicates from it. In that case the duplicated points are most probably multiple individuals caught for the same crime, and if we delete those it will not change the results of the analysis.
I decided to focus on drug related crime, since they are not as common as other and therefore I can better present the steps of the analysis. We can subset the data and remove duplicates as follows:

Drugs <- data.London[data.London$Crime.type==unique(data.London$Crime.type)[3],]
Drugs <- remove.duplicates(Drugs)

we obtain a dataset with 2745 events all over Greater London.
A point pattern is defined as a series of events in a given area, or window, of observation. It is therefore extremely important to precisely define this window. In spatstat the function owin is used to set the observation window. However, the standard function takes the coordinates of a rectangle or of a polygon from a matrix, and therefore it may be a bit tricky to use. Luckily the package maptools provides a way to transform a SpatialPolygons into an object of class owin, using the function as.owin (Note: a function with the same name is also available in spatstat but it does not work with SpatialPolygons, so be sure to load maptools):

window <- as.owin(GreaterLondon)

Now we can use the function ppp, in spatstat, to create the point pattern object:

Drugs.ppp <- ppp(x=Drugs@coords[,1],y=Drugs@coords[,2],window=window)

Intensity and Density
A crucial information we need when we deal with point patterns is a quantitative definition of the spatial distribution, i.e. how many events we have in a predefined window. The index to define this is the Intensity, which is the average number of events per unit area.
In this example we cannot calculate the intensity straight away, because the we are dealing with degrees and therefore we would end up dividing the number of crimes (n=2745) by the total area of Greater London, which in degrees in 0.2. It would make much more sense to transform all of our data in UTM and then calculate the number of crime per square meter. We can transform any spatial object in a different coordinate system using the function spTransform, in package sp:

GreaterLondonUTM <- spTransform(GreaterLondon,CRS("+init=epsg:32630"))

We just need to define the CRS of the new coordinate system, which can be found here: http://spatialreference.org/

Now we can compute the intensity as follows:

Drugs.ppp$n/sum(sapply(slot(GreaterLondonUTM, "polygons"), slot, "area"))

The numerator is the number of point in the ppp object; while the denominator is the sum of the areas of all polygons (this function was copied from here: r-sig-geo). For drug related crime the average intensity is 1.71x10^-6 per square meter, in the Greater London area.

Intensity may be constant across the study window, in that case in every square meter we would find the same number of points, and the process would be uniform of homogeneous. Most often the intensity is not constant and varies spatially throughout the study window, in that case the process is inhomogeneous. For inhomogeneous processes we need a way to determine the amount of spatial variation of the intensity. There are several ways of dealing with this problem, one example is quadrat counting, where the area is divided into rectangles and the number of events in each of them is counted:

jpeg("PP_QuadratCounting.jpeg",2500,2000,res=300)
plot(Drugs.ppp,pch="+",cex=0.5,main="Drugs")
plot(quadratcount(Drugs.ppp, nx = 4, ny = 4),add=T,col="blue")
dev.off()

which divides the area in 8 rectangles and then counts the number of events in each of them:

This function is good for certain datasets, but in this case it does not really make sense to use quadrat counting, since the areas it creates do not have any meaning in reality. It would be far more valuable to extract the number of crimes by Borough for example. To do this we need to use a loop and iterate through the polygons:

Local.Intensity <- data.frame(Borough=factor(),Number=numeric())
for(i in unique(GreaterLondonUTM$name)){
sub.pol <- GreaterLondonUTM[GreaterLondonUTM$name==i,]
 
sub.ppp <- ppp(x=Drugs.ppp$x,y=Drugs.ppp$y,window=as.owin(sub.pol))
Local.Intensity <- rbind(Local.Intensity,data.frame(Borough=factor(i,levels=GreaterLondonUTM$name),Number=sub.ppp$n))
}

We can take a look at the results in a barplot with the following code:

colorScale <- color.scale(Local.Intensity[order(Local.Intensity[,2]),2],color.spec="rgb",extremes=c("green","red"),alpha=0.8)
 
jpeg("PP_BoroughCounting.jpeg",2000,2000,res=300)
par(mar=c(5,13,4,2)) 
barplot(Local.Intensity[order(Local.Intensity[,2]),2],names.arg=Local.Intensity[order(Local.Intensity[,2]),1],horiz=T,las=2,space=1,col=colorScale)
dev.off()

which returns the image below:

Another way in which we can determine the spatial distribution of the intensity is by using kernel smoothing (Diggle, 1985; Berman and Diggle, 1989; Bivand et. al., 2008). Such method computes the intensity continuously across the study area. To perform this analysis in R we need to define the bandwidth of the density estimation, which basically determines the area of influence of the estimation. There is no general rule to determine the correct bandwidth; generally speaking if h is too small the estimate is too noisy, while if h is too high the estimate may miss crucial elements of the point pattern due to oversmoothing (Scott, 2009). In spatstat the functions bw.diggle, bw.ppl, and bw.scott can be used to estimate the bandwidth according to difference methods. We can test how they work with our dataset using the following code:

jpeg("Kernel_Density.jpeg",2500,2000,res=300)
par(mfrow=c(2,2))
plot(density.ppp(Drugs.ppp, sigma = bw.diggle(Drugs.ppp),edge=T),main=paste("h =",round(bw.diggle(Drugs.ppp),2)))
plot(density.ppp(Drugs.ppp, sigma = bw.ppl(Drugs.ppp),edge=T),main=paste("h =",round(bw.ppl(Drugs.ppp),2)))
plot(density.ppp(Drugs.ppp, sigma = bw.scott(Drugs.ppp)[2],edge=T),main=paste("h =",round(bw.scott(Drugs.ppp)[2],2)))
plot(density.ppp(Drugs.ppp, sigma = bw.scott(Drugs.ppp)[1],edge=T),main=paste("h =",round(bw.scott(Drugs.ppp)[1],2)))
dev.off()

which generates the following image, from which it is clear that every method works very differently:

As you can see a low value of bandwidth produces a very detailed plot, while increasing this value creates a very smooth surface where the local details are lost. This is basically an heat map of the crimes in London, therefore we need to be very careful in choosing the right bandwidth since these images if shown alone may have very different impact particularly on people not familiar with the matter. The first image may create the illusion that the crimes are clustered in very small areas, while the last may provide the opposite feeling.

Complete spatial randomness
Assessing if a point pattern is random is a crucial step of the analysis. If we determine that the pattern is random it means that each point is independent from each other and from any other factor. Complete spatial randomness implies that events from the point process are equally as likely to occur in every regions of the study window. In other words, the location of one point does not affect the probability of another being observed nearby, each point is therefore completely independent from the others (Bivand et al., 2008).
If a point pattern is not random it can be classified in two other ways: clustered or regular. Clustered means that there are areas where the number of events is higher than average, regular means that basically each subarea has the same number of events. Below is an image that should better explain the differences between these distributions:

In spatstat we can determine which distribution our data have using the G function, which computes the distribution of the distances between each event and its nearest neighbour (Bivand et al., 2008). Based on the curve generated by the G function we can determine the distribution of our data. I will not explain here the details on how to compute the G function and its precise meaning, for that you need to look at the references. However, just by looking at the plots we can easily determine the distribution of our data.
Let's take a look at the image below to clarify things:

These are the curves generated by the G function for each distribution. The blue line is the G function computed for a complete spatial random point pattern, so in the first case since the data more or less follow the blue line the process is random. In the second case the line calculated from the data is above the blue line, this indicates a clustered distribution. On the contrary, if the line generated from the data is below the blue line the point pattern is regular.
We can compute the plot this function for our data simply using the following lines:

jpeg("GFunction.jpeg",2500,2000,res=300)
plot(Gest(Drugs.ppp),main="Drug Related Crimes")
dev.off()

which generates the following image:

From this image is clear that the process is clustered. We could have deduced it by looking at the previous plots, since it is clear that there are areas where more crimes are committed; however, with this method we have a quantitative way of support our hypothesis.

Conclusion
In this experiment we performed some basic Point Pattern analysis on open crime data. The only conclusion we reached in this experiment is that the data are clearly clustered in certain areas and boroughs. However, at this point we are not able to determine the origin and the causes of these clusters.

References
Bivand, R. S., Pebesma, E. J., & Gómez-Rubio, V. (2008). Applied spatial data analysis with R (Vol. 747248717). New York: Springer.

Wu, C. (2006). Intermediate Geographic Information Science – Point Pattern Analysis. Department of Geography, The University of Winsconsin-Milwaukee. http://uwm.edu/Course/416-625/week4_point_pattern.ppt - Last accessed: 28.01.2015

Berman, M. and Diggle, P. J. (1989). Estimating weighted integrals of the second-order intensity of a spatial point process. Journal of the Royal Statistical Society B, 51:81–92. [184, 185]

Diggle, P. J. (1985). A kernel method for smoothing point process data. Applied Statistics, 34:138–147. [184, 185]

Scott, D. W. (2009). Multivariate density estimation: theory, practice, and visualization (Vol. 383). John Wiley & Sons.

R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

sjmisc – package for working with (labelled) data #rstats

May 22, 2015, 7:14 am

≫ Next: Review of ‘Advanced R’ by Hadley Wickham

≪ Previous: Introductory Point Pattern Analysis of Open Crime Data in London

(This article was first published on Strenge Jacke! » R, and kindly contributed to R-bloggers)

The sjmisc-package

My last posting was about reading and writing data between R and other statistical packages like SPSS, Stata or SAS. After that, I decided to bundle all functions that are not directly related to plotting or printing tables, into a new package called sjmisc.

Basically, this package covers three domains of functionality:

reading and writing data between other statistical packages (like SPSS) and R, based on the haven and foreign packages; hence, sjmisc also includes function to work with labelled data.
frequently used statistical tests, or at least convenient wrappers for such test functions
frequently applied recoding and variable conversion tasks

In this posting, I want to give a quick and short introduction into the labeling features.

Labelled Data

In software like SPSS, it is common to have value and variable labels as variable attributes. Variable values, even if categorical, are mostly numeric. In R, however, you may use labels as values directly:

> factor(c("low", "high", "mid", "high", "low"))
[1] low  high mid  high low 
Levels: high low mid

Reading SPSS-data (from haven, foreign or sjmisc), keeps the numeric values for variables and adds the value and variable labels as attributes. See following example from the sample-dataset efc, which is part of the sjmisc-package:

library(sjmisc)
data(efc)
str(efc$e42dep)

> atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
> - attr(*, "label")= chr "how dependent is the elder? - subjective perception of carer"
> - attr(*, "labels")= Named num [1:4] 1 2 3 4
>  ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

While all plotting and table functions of the sjPlot-package make use of these attributes (see many examples here), many packages and/or functions do not consider these attributes, e.g. R base graphics:

library(sjmisc)
data(efc)
barplot(table(efc$e42dep, efc$e16sex), 
        beside = T, 
        legend.text = T)

Adding value labels as factor values

to_label is a sjmisc-function that converts a numeric variable into a factor and sets attribute-value-labels as factor levels. Using factors with valued levels, the bar plot is labelled.

library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T)

to_fac is a convenient replacement of as.factor, which converts a numeric vector into a factor, but keeps the value and variable label attributes.

Getting and setting value and variable labels

There are four functions that let you easily set or get value and variable labels of either a single vector or a complete data frame:

get_var_labels() to get variable labels
get_val_labels() to get value labels
set_var_labels() to set variable labels (add them as vector attribute)
set_val_labels() to set value labels (add them as vector attribute)

library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T,
        main = get_var_labels(efc$e42dep))

get_var_labels(efc) would return all data.frame’s variable labels. And get_val_labels(etc) would return a list with all value labels of all data.frame’s variables.

Restore labels from subsetted data

The base subset function as well as dplyr’s (at least up to 0.4.1) filter and select functions omit label attributes (or vector attributes in general) when subsetting data. In the current development-snapshot of sjmisc at GitHub (which will most likely become version 1.0.3 and released in June or July), there are handy functions to deal with this problem: add_labels and remove_labels.

add_labels adds back labels to a subsetted data frame based on the original data frame. And remove_labels removes all label attributes (this might be necessary when working with dplyr up to 0.4.1, dplyr sometimes throws an error when working with labelled data – this issue should be addressed for the next dplyr-update).

Losing labels during subset

library(sjmisc)
data(efc)
efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8))
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
> $ e17age : num  74 68 80 72 94 79 67 80 76 88 ...
> $ e42dep : num  4 4 1 3 3 4 3 4 2 4 ...
> $ c82cop1: num  4 3 3 4 3 3 4 2 2 3 ...
> $ c83cop2: num  2 4 2 2 2 2 1 3 2 2 ...
> $ c84cop3: num  4 4 1 1 1 4 2 4 2 4 ...

Add back labels

efc.sub <- add_labels(efc.sub, efc)
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
>  $ e17age : atomic  74 68 80 72 94 79 67 80 76 88 ...
>   ..- attr(*, "label")= Named chr "elder' age"
>   .. ..- attr(*, "names")= chr "e17age"
>  $ e42dep : atomic  4 4 1 3 3 4 3 4 2 4 ...
>   ..- attr(*, "label")= Named chr "how dependent is the elder? - subjective perception of carer"
>   .. ..- attr(*, "names")= chr "e42dep"
>   ..- attr(*, "labels")= Named chr  "1" "2" "3" "4"
>   .. ..- attr(*, "names")= chr  "independent" "slightly dependent" "moderately dependent" "severely dependent"

# truncated output

So, when working with labelled data, especially when working with data sets imported from other software packages, it comes very handy to make use of the label attributes. The sjmisc package supports this feature and offers some useful functions for these tasks…

Tagged: R, rstats, sjmisc, sjPlot, SPSS, Statistik

To leave a comment for the author, please follow the link and comment on his blog: Strenge Jacke! » R.

↧

Review of ‘Advanced R’ by Hadley Wickham

May 24, 2015, 11:57 am

≫ Next: Geomorph update 2.1.5 Now Available!

≪ Previous: sjmisc – package for working with (labelled) data #rstats

(This article was first published on Burns Statistics » R language, and kindly contributed to R-bloggers)

Executive summary

Surprisingly good.

And it’s not like my expectations were especially low.

Structure

There are 20 chapters. I mostly like the chapters and their order.

Hadley breaks the 20 chapters into 4 parts. He’s wrong. Figure 1 illustrates the correct way to formulate parts.

Figure 1: Chapters and Parts of Advanced R.

Introductory R

There are by now lots of introductions to R. There’s even an R for Dummies.

The introduction here is clean and serviceable. If you are an experienced programmer learning R, these chapters will provide you with most of the basics.

Interlude

Chapter 5 on code style is a bit of fluff that insulates the introductory material from the more advanced part. I hope most people can just about cope with there being an extra space or new line compared to their preference.

However, the chapter does talk about one thing that I think is very important: consistent naming style. Consistent naming conserves a lot of energy for users — it allows them to think about what they are doing rather than trying to remember trivia. I believe my personal best for continuous consistent naming in R is 4.3 hours of coding time. In R it’s hard.

Language R

In which the balance of the universe is partially restored.

There are lots of places that talk about what a chaotic mess R is. A book-length dose of venom is The R Inferno.

These chapters, in contrast, show the elegance, the flexibility, the power of the R language. This is the mesmerizing part of the book.

Both views are valid: the mess in on the surface, the beauty is deeper.

Working with R

Useful. But it includes the word “work” so it can’t be all good.

Favorite sentences

R doesn’t protect you from yourself: you can easily shoot yourself in the foot. As long as you don’t aim the gun at your foot and pull the trigger, you won’t have a problem.

On speed:

R was purposely designed to make data analysis and statistics easier for you to do. It was not designed to make life easier for your computer. While R is slow compared to other programming languages, for most purposes, it’s fast enough.

On object orientation:

S3 is informal and ad hoc, but it has a certain elegance in its minimalism: you can’t take away any part of it and still have a useful OO system.

Appendix R

Here is the code that created Figure 1:

P.advanced_R_table_of_contents <- 
 function (filename = "advanced_R_table_of_contents.png", seed=18) 
{
  if(length(filename)) {
    png(file=filename, width=512, height=700)
    par(mar=rep(0,4)+.1)
  }
  advR.chap <- c('Introduction', 'Data structures', 'Subsetting', 'Vocabulary',
                 'Style guide', 'Functions', 'OO field guide', 'Environments', 
                 'Debugging, condition handling, and defensive programming', 
                 'Functional programming', 'Functionals', 'Function operators',
                 'Non-standard evaluation', 'Expressions', 'Domain specific languages',
                 'Performance', 'Optimising code', 'Memory', 
                 'High performance functions with Rcpp', "R's C interface")
  advR.chap[9] <- "Debugging, [...]"
  plot.new()
  plot.window(xlim=c(0,4), ylim=c(21,0))
  hl <- 1.7
  pl <- 2.3
 
  if(length(seed) && !is.na(seed)) set.seed(seed)

  text(c(.5, 2, 3.5), .5, c("Hadley part", "Chapter", "Pat part"), font=2)
  rect(0, 1.5, hl, 9.5, col=do.call('rgb', as.list(runif(3, .8, 1))), border=NA)
  rect(0, 9.5, hl, 12.5, col=do.call('rgb', as.list(runif(3, .8, 1))), border=NA)
  rect(0, 12.5, hl, 15.5, col=do.call('rgb', as.list(runif(3, .8, 1))), border=NA)
  rect(0, 15.5, hl, 20.5, col=do.call('rgb', as.list(runif(3, .8, 1))), border=NA)
 
  rect(pl, 1.5, 4, 4.5, col=do.call('rgb', as.list(runif(3, .8, 1))), border=NA)
  rect(pl, 5.5, 4, 14.5, col=do.call('rgb', as.list(runif(3, .8, 1))), border=NA)
  rect(pl, 14.5, 4, 20.5, col=do.call('rgb', as.list(runif(3, .8, 1))), border=NA)
 
  text(2, 1:20, advR.chap)
  text(hl/2, 5.5, "Foundations")
  text(hl/2, 11, "Functionalnprogramming")
  text(hl/2, 14, "Computingnon thenlanguage")
  text(hl/2, 18, "Performance")
  text((4+pl)/2, 3, "IntroductorynR")
  text((4+pl)/2, 10, "LanguagenR")
  text((4+pl)/2, 17.5, "Working withnR")
 
  if(length(filename)) {
    dev.off()
  }
}

The part that may be of most interest is that the colors are randomly generated. Not all colors are allowed — they can only go so dark.

The post Review of ‘Advanced R’ by Hadley Wickham appeared first on Burns Statistics.

To leave a comment for the author, please follow the link and comment on his blog: Burns Statistics » R language.

↧

Geomorph update 2.1.5 Now Available!

May 24, 2015, 3:57 pm

≫ Next: Interactive maps of Crime data in Greater London

≪ Previous: Review of ‘Advanced R’ by Hadley Wickham

(This article was first published on geomorph, and kindly contributed to R-bloggers)

Geomorph users,

We have uploaded version 2.1.5 of geomorph* to CRAN. The windows and mac binaries have been compiled and the tarball is available.

New Features:

New Auto Mode allows users to include pre-digitized landmarks added to build.template() and digitsurface()

New gridPar() is a new function to customize plots of plotRefToTarget()

New digit.curves() is a new function to calculate equidistant semilandmarks along 2D and 3D curves (based on tpsDIG algorithm for 2D curves).

define.sliders() is new interactive function for defining sliding semilandmarks for 2D and 3D curves, plus an automatic mode when given a sequence of semilandmarks along a curve

plotGMPhyloMorphoSpace() now has options to customise the plots

Important Bug Fixes:

Corrected an error in plotAllometry() where verbose=T did not return

Other Changes:

pairwiseD.test() and pairwise.slope.test() deprecated and replaced by advanced.procD.lm()

Read functions now allow both tab and space delimited files

define.sliders.2d() and define.sliders.3d() deprecated and replaced by define.sliders()

Emma

* geomorph: Geometric Morphometric Analyses of 2D/3D Landmark Data

Read, manipulate, and digitize landmark data, generate shape variables via Procrustes analysis for points, curves and surfaces, perform shape analyses, and provide graphical depictions of shapes and patterns of shape variation.

To leave a comment for the author, please follow the link and comment on his blog: geomorph.

↧

Interactive maps of Crime data in Greater London

May 25, 2015, 5:55 am

≫ Next: An R Enthusiast Goes Pythonic!

≪ Previous: Geomorph update 2.1.5 Now Available!

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

In the previous post we looked at ways to perform some introductory point pattern analysis of open data downloaded from Police.uk. As you remember we subset the dataset of crimes in the Greater London area, extracting only the drug related ones. Subsequently, we looked at ways to use those data with the package spatstat and perform basic statistics.
In this post I will briefly discuss ways to create interactive plots of the results of the point pattern analysis using the Google Maps API and Leaflet from R.

Number of Crimes by Borough
In the previous post we looped through the GreaterLondonUTM shapefile to extract the area of each borough and then counted the number of crimes within its border. To show the results we used a simple barplot. Here I would like to use the same method I presented in my post Interactive Maps for the Web to plot these results on Google Maps.

This post is intended to be a continuation of the previous, so I will not present again the methods and objects we used in the previous experiment. To make this code work you can just copy and paste it below the code you created before and it should work just fine.

First of all, let's create a new object including only the names of the boroughs from the GreaterLondonUTM shapefile. We need to do this because otherwise when we will click on a polygons on the map it will show us a long list of useless data.

GreaterLondon.Google <- GreaterLondonUTM[,"name"]

The new object has only one column with the name of each borough.
Now we can create a loop to iterate through these names and calculate the intensity of the crimes:

Borough <- GreaterLondonUTM[,"name"]
 
for(i in unique(GreaterLondonUTM$name)){
sub.name <- Local.Intensity[Local.Intensity[,1]==i,2]
 
Borough[Borough$name==i,"Intensity"] <- sub.name
 
Borough[Borough$name==i,"Intensity.Area"] <- round(sub.name/(GreaterLondonUTM[GreaterLondonUTM$name==i,]@polygons[[1]]@area/10000),4)
}

As you can see this loop selects one name at the time, then subset the object Local.Intensity (which we created in the previous post) to extract the number of crimes for each borough. The next line attach this intensity to the object Borough as a new column named Intensity. However, the code does not stop here. We also create another column named Intensity.Area in which we calculate the amount of crimes per unit area. Since the area from the shapefile is in square meters and the number were very high, I though about dividing it by 10'000 in order to have a unit area of 10 square km. So this column shows the amount of crime per 10 square km in each borough. This should correct the fact that certain borough have a relatively high number of crimes only because their area is larger than others.

Now we can use again the package plotGoogleMaps to create a beautiful visualization of our results and save it in HTML so that we can upload it to our website or blog.
The code for doing that is very simple and it is presented below:

plotGoogleMaps(Borough,zcol="Intensity",filename="Crimes_Boroughs.html",layerName="Number of Crimes", fillOpacity=0.4,strokeWeight=0,mapTypeId="ROADMAP")

I decided to plot the polygons on top of the roadmap and not on top of the satellite image, which is the default for the function. Thus I added the option mapTypeId="ROADMAP".
The result is the map shown below and at this link: Crimes on GoogleMaps

In the post Interactive Maps for the Web in R I received a comment from Gerardo Celis, whom I thank for it, telling me that now in R is also available the package leafletR, that allows us to create interactive maps based on Leaflet. So for this new experiment I decided to try it out!

I started from the sample of code presented here: https://github.com/chgrl/leafletR and I adapted with very few changes to my data.
The function leaflet does not work directly with Spatial data, we first need to transform them into GeoJSON with another function in leafletR:

Borough.Leaflet <- toGeoJSON(Borough)

Extremely simple!!

Now we need to set the style to use for plotting the polygons using the function styleGrad, which is used to create a list of colors based on a particular attribute:

map.style <- styleGrad(pro="Intensity",breaks=seq(min(Borough$Intensity),max(Borough$Intensity)+15,by=20),style.val=cm.colors(10),leg="Number of Crimes", fill.alpha=0.4, lwd=0)

In this function we need to set several options:
pro = is the name of the attribute (as the column name) to use for setting the colors
breaks = this option is used to create the ranges of values for each colors. In this case, as in the example, I just created a sequence of values from the minimum to the maximum. As you can see from the code I added 15 to the maximum value. This is because the number of breaks needs to have 1 more element compared to the number of colors. For example, if we set 10 breaks we would need to set 9 colors. For this reason if the sequence of breaks ends before the maximum, the polygons with the maximum number of crimes would be presented in grey.
This is important!!

style.val = this option takes the color scale to be used to present the polygons. We can select among one of the default scales or we can create a new one with the function color.scale in the package plotrix, which I already discussed here: Downloading and Visualizing Seismic Events from USGS

leg = this is simply the title of the legend
fill.alpha = is the opacity of the colors in the map (ranges from 0 to 1, where 1 is the maximum)
lwd = is the width of the line between polygons

After we set the style we can simply call the function leaflet to create the map:

leaflet(Borough.Leaflet,popup=c("name","Intensity","Intensity.Area"),style=map.style)

In this function we need to input the name of the GeoJSON object we created before, the style of the map and the names of the columns to use for the popups.
The result is the map shown below and available at this link: Leaflet Map

I must say this function is very neat. First of all the function plotGoogleMaps, if you do not set the name of the HTML file, creates a series of temporary files stored in your temp folder, which is not great. Then even if you set the name of the file the legend is saved into different image files every time you call the function, which you may do many times until you are fully satisfied the result.
The package leafletR on the other hand creates a new folder inside the working directory where it stores both the GeoJSON and the HTML file, and every time you modify the visualization the function overlays the same file.
However, I noticed that I cannot see the map if I open the HTML files from my PC. I had to upload the file to my website every time I changed it to actually see these changes and how they affected the plot. This may be something related to my PC, however.

Density of Crimes in raster format
As you may remember from the previous post, one of the steps included in a point pattern analysis is the computation of the spatial density of the events. One of the techniques to do that is the kernel density, which basically calculates the density continuously across the study area, thus creating a raster.
We already looked at the kernel density in the previous post so I will not go into details here, the code for computing the density and transform it into a raster is the following:

Density <- density.ppp(Drugs.ppp, sigma = 500,edge=T,W=as.mask(window,eps=c(100,100)))
Density.raster <- raster(Density)
projection(Density.raster)=projection(GreaterLondonUTM)

The first lines is basically the same we used in the previous post. The only difference is that here I added the option W to set the resolution of the map with eps at 100x100 m.
Then I simply transformed the first object into a raster and assign to it the same UTM projection of the object GreaterLondonUTM.
Now we can create the map. As far as I know (and for what I tested) leafletR is not yet able to plot raster objects, so the only way we have of doing it is again to use the function plotGoogleMaps:

plotGoogleMaps(Density.raster,filename="Crimes_Density.html",layerName="Number of Crimes", fillOpacity=0.4,strokeWeight=0,colPalette=rev(heat.colors(10)))

When we use this function to plot a raster we clearly do not need to specify the zcol option. Moreover, here I changed the default color scale using the function colPalette to a reverse heat.colors, which I think is more appropriate for such a map. The result is the map below and at this link: Crime Density

Density of Crimes as contour lines
The raster presented above can also be represented as contour lines. The advantage of this type of visualization is that it is less intrusive, compared to a raster, and can also be better suited to pinpoint problematic locations.
Doing this in R is extremely simple, since there is a dedicated function in the package raster:

Contour <- rasterToContour(Density.raster,maxpixels=100000,nlevels=10)

This function transforms the raster above into a series of 10 contour lines (we can change the number of lines by changing the option nlevels).

Now we can plot these lines to an interactive web map. I first tested again the use of plotGoogleMaps but I was surprised to see that for contour lines it does not seem to do a good job. I do not fully know the reason, but if I use the object Contour with this function it does not plot all the lines on the Google map and therefore the visualization is useless.
For this reason I will present below the lines to plot contour lines using leafletR:

Contour.Leaflet <- toGeoJSON(Contour)
 
colour.scale <- color.scale(1:(length(Contour$level)-1),color.spec="rgb",extremes=c("red","blue"))
map.style <- styleGrad(pro="level",breaks=Contour$level,style.val=colour.scale,leg="Number of Crimes", lwd=2)
leaflet(Contour.Leaflet,style=map.style,base.map="tls")

As mentioned, the first thing to do to use leafletR is to transform our Spatial object into a GeoJSON; the object Contour belongs to the class SpatialLinesDataFrame, so it is supported in the function toGeoJSON.
The next step is again to set the style of the map and then plot it. In this code I changed a few things just to show some more options. The first thing is the custom color scale I created using the function color.scale in the package plotrix. The only thing that the function styleGrad needs to set the colors in the option style.val is a vector of colors, which must be long one unit less than the vector used for the breaks. In this case the object Contour has only one property, namely "level", which is a vector of class factor. The function styleGrad can use it to create the breaks but the function color.scale cannot use it to create the list of colors. We can work around this problem by setting the length of the color.scale vector using another vector: 1:(length(Contour$level)-1, which basically creates a vector of integers from 1 to the length of Contours minus one. The result of this function is a vector of colors ranging from red to blue, which we can plug in in the following function.
In the function leaflet the only thing I changed is the base.map option, in which I use "tls". From the help page of the function we can see that the following options are available:

"One or a list of "osm" (OpenStreetMap standard map), "tls" (Thunderforest Landscape), "mqosm" (MapQuest OSM), "mqsat" (MapQuest Open Aerial),"water" (Stamen Watercolor), "toner" (Stamen Toner), "tonerbg" (Stamen Toner background), "tonerlite" (Stamen Toner lite), "positron" (CartoDB Positron) or "darkmatter" (CartoDB Dark matter). "

These lines create the following image, available as a webpage here: Contour

R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics.

↧

An R Enthusiast Goes Pythonic!

May 28, 2015, 8:06 am

≫ Next: RevoScaleR’s Naive Bayes Classifier rxNaiveBayes()

≪ Previous: Interactive maps of Crime data in Greater London

(This article was first published on Data Until I Die!, and kindly contributed to R-bloggers)

I’ve spent so many years using and broadcasting my love for R and using Python quite minimally. Having read recently about machine learning in Python, I decided to take on a fun little ML project using Python from start to finish.

What follows below takes advantage of a neat dataset from the UCI Machine Learning Repository. The data contain Math test performance of 649 students in 2 Portuguese schools. What’s neat about this data set is that in addition to grades on the students’ 3 Math tests, they managed to collect a whole whack of demographic variables (and some behavioural) as well. That lead me to the question of how well can you predict final math test performance based on demographics and behaviour alone. In other words, who is likely to do well, and who is likely to tank?

I have to admit before I continue, I initially intended on doing this analysis in Python alone, but I actually felt lost 3 quarters of the way through and just did the whole darned thing in R. Once I had completed the analysis in R to my liking, I then went back to my Python analysis and continued until I finished to my reasonable satisfaction. For that reason, for each step in the analysis, I will show you the code I used in Python, the results, and then the same thing in R. Do not treat this as a comparison of Python’s machine learning capabilities versus R per se. Please treat this as a comparison of my understanding of how to do machine learning in Python versus R!

Without further ado, let’s start with some import statements in Python and library statements in R:

#Python Code
from pandas import *
from matplotlib import *
import seaborn as sns
sns.set_style("darkgrid")
import matplotlib.pyplot as plt
%matplotlib inline # I did this in ipython notebook, this makes the graphs show up inline in the notebook.
import statsmodels.formula.api as smf
from scipy import stats
from numpy.random import uniform
from numpy import arange
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
mat_perf = read_csv('/home/inkhorn/Student Performance/student-mat.csv', delimiter=';')

I’d like to comment on the number of import statements I found myself writing in this python script. Eleven!! Is that even normal? Note the smaller number of library statements in my R code block below:

#R Code
library(ggplot2)
library(dplyr)
library(ggthemr)
library(caret)
ggthemr('flat') # I love ggthemr!
mat_perf = read.csv('student-mat.csv', sep = ';')

Now let’s do a quick plot of our target variable, scores on the students’ final math test, named ‘G3′.

#Python Code
sns.set_palette("deep", desat=.6)
sns.set_context(context='poster', font_scale=1)
sns.set_context(rc={"figure.figsize": (8, 4)})
plt.hist(mat_perf.G3)
plt.xticks(range(0,22,2))

Distribution of Final Math Test Scores (“G3″)

That looks pretty pleasing to my eyes. Now let’s see the code for the same thing in R (I know, the visual theme is different. So sue me!)

#R Code
ggplot(mat_perf) + geom_histogram(aes(x=G3), binwidth=2)

You’ll notice that I didn’t need to tweak any palette or font size parameters for the R plot, because I used the very fun ggthemr package. You choose the visual theme you want, declare it early on, and then all subsequent plots will share the same theme! There is a command I’ve hidden, however, modifying the figure height and width. I set the figure size using rmarkdown, otherwise I just would have sized it manually using the export menu in the plot frame in RStudio. I think both plots look pretty nice, although I’m very partial to working with ggthemr!

Univariate estimates of variable importance for feature selection

Below, what I’ve done in both languages is to cycle through each variable in the dataset (excepting prior test scores) insert the variable name in a dictionary/list, and get a measure of importance of how predictive that variable is, alone, of the final math test score (variable G3). Of course if the variable is qualitative then I get an F score from an ANOVA, and if it’s quantitative then I get a t score from the regression.

In the case of Python this is achieved in both cases using the ols function from scipy’s statsmodels package. In the case of R I’ve achieved this using the aov function for qualitative and the lm function for quantitative variables. The numerical outcome, as you’ll see from the graphs, is the same.

#Python Code
test_stats = {'variable': [], 'test_type' : [], 'test_value' : []}

for col in mat_perf.columns[:-3]:
    test_stats['variable'].append(col)
    if mat_perf[col].dtype == 'O':
        # Do ANOVA
        aov = smf.ols(formula='G3 ~ C(' + col + ')', data=mat_perf, missing='drop').fit()
        test_stats['test_type'].append('F Test')
        test_stats['test_value'].append(round(aov.fvalue,2))
    else:
        # Do correlation
        print col + 'n'
        model = smf.ols(formula='G3 ~ ' + col, data=mat_perf, missing='drop').fit()
        value = round(model.tvalues[1],2)
        test_stats['test_type'].append('t Test')
        test_stats['test_value'].append(value)

test_stats = DataFrame(test_stats)
test_stats.sort(columns='test_value', ascending=False, inplace=True)

#R Code
test.stats = list(test.type = c(), test.value = c(), variable = c())

for (i in 1:30) {
  test.stats$variable[i] = names(mat_perf)[i]
  if (is.factor(mat_perf[,i])) {
    anova = summary(aov(G3 ~ mat_perf[,i], data=mat_perf))
    test.stats$test.type[i] = "F test"
    test.stats$test.value[i] = unlist(anova)[7]
  }
  else {
    reg = summary(lm(G3 ~ mat_perf[,i], data=mat_perf))
    test.stats$test.type[i] = "t test"
    test.stats$test.value[i] = reg$coefficients[2,3]
  }

}

test.stats.df = arrange(data.frame(test.stats), desc(test.value))
test.stats.df$variable = reorder(test.stats.df$variable, -test.stats.df$test.value)

And now for the graphs. Again you’ll see a bit more code for the Python graph vs the R graph. Perhaps someone will be able to show me code that doesn’t involve as many lines, or maybe it’s just the way things go with graphing in Python. Feel free to educate me :)

#Python Code
f, (ax1, ax2) = plt.subplots(2,1, figsize=(48,18), sharex=False)
sns.set_context(context='poster', font_scale=1)
sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 'F Test'"), hline=.1, ax=ax1, x_order=[x for x in test_stats.query("test_type == 'F Test'")['variable']])
ax1.set_ylabel('F Values')
ax1.set_xlabel('')

sns.barplot(x='variable', y='test_value', data=test_stats.query("test_type == 't Test'"), hline=.1, ax=ax2, x_order=[x for x in test_stats.query("test_type == 't Test'")['variable']])
ax2.set_ylabel('t Values')
ax2.set_xlabel('')

sns.despine(bottom=True)
plt.tight_layout(h_pad=3)

#R Code
ggplot(test.stats.df, aes(x=variable, y=test.value)) +
  geom_bar(stat="identity") +
  facet_grid(.~test.type ,  scales="free", space = "free") +
  theme(axis.text.x = element_text(angle = 45, vjust=.75, size=11))

As you can see, the estimates that I generated in both languages were thankfully the same. My next thought was to use only those variables with a test value (F or t) of 3.0 or higher. What you’ll see below is that this led to a pretty severe decrease in predictive power compared to being liberal with feature selection.

In reality, the feature selection I use below shouldn’t be necessary at all given the size of the data set vs the number of predictors, and the statistical method that I’m using to predict grades (random forest). What’s more is that my feature selection method in fact led me to reject certain variables which I later found to be important in my expanded models! For this reason it would be nice to investigate a scalable multivariate feature selection method (I’ve been reading a bit about boruta but am skeptical about how well it scales up) to have in my tool belt. Enough blathering, and on with the model training:

Training the First Random Forest Model

#Python code
usevars =  [x for x in test_stats.query("test_value >= 3.0 | test_value <= -3.0")['variable']]
mat_perf['randu'] = np.array([uniform(0,1) for x in range(0,mat_perf.shape[0])])

mp_X = mat_perf[usevars]
mp_X_train = mp_X[mat_perf['randu'] <= .67]
mp_X_test = mp_X[mat_perf['randu'] > .67]

mp_Y_train = mat_perf.G3[mat_perf['randu'] <= .67]
mp_Y_test = mat_perf.G3[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train.columns if mp_X_train[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train = concat([mp_X_train, new_cols], axis=1)

# for the testing set
cat_cols = [x for x in mp_X_test.columns if mp_X_test[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test = concat([mp_X_test, new_cols], axis=1)

mp_X_train.drop(cat_cols, inplace=True, axis=1)
mp_X_test.drop(cat_cols, inplace=True, axis=1)

rf = RandomForestRegressor(bootstrap=True,
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf.fit(mp_X_train, mp_Y_train)

After I got past the part where I constructed the training and testing sets (with “unimportant” variables filtered out) I ran into a real annoyance. I learned that categorical variables need to be converted to dummy variables before you do the modeling (where each level of the categorical variable gets its own variable containing 1s and 0s. 1 means that the level was present in that row and 0 means that the level was not present in that row; so called “one-hot encoding”). I suppose you could argue that this puts less computational demand on the modeling procedures, but when you’re dealing with tree based ensembles I think this is a drawback. Let’s say you have a categorical variable with 5 features, “a” through “e”. It just so happens that when you compare a split on that categorical variable where “abc” is on one side and “de” is on the other side, there is a very significant difference in the dependent variable. How is one-hot encoding going to capture that? And then, your dataset which had a certain number of columns now has 5 additional columns due to the encoding. “Blah” I say!

Anyway, as you can see above, I used the get_dummies function in order to do the one-hot encoding. Also, you’ll see that I’ve assigned two thirds of the data to the training set and one third to the testing set. Now let’s see the same steps in R:

#R Code
keep.vars = match(filter(test.stats.df, abs(test.value) >= 3)$variable, names(mat_perf))
ctrl = trainControl(method="repeatedcv", number=10, selectionFunction = "oneSE")
mat_perf$randu = runif(395)
test = mat_perf[mat_perf$randu > .67,]
trf = train(mat_perf[mat_perf$randu <= .67,keep.vars], mat_perf$G3[mat_perf$randu <= .67],
            method="rf", metric="RMSE", data=mat_perf,
            trControl=ctrl, importance=TRUE)

Wait a minute. Did I really just train a Random Forest model in R, do cross validation, and prepare a testing data set with 5 commands!?!? That was a lot easier than doing these preparations and not doing cross validation in Python! I did in fact try to figure out cross validation in sklearn, but then I was having problems accessing variable importances after. I do like the caret package :) Next, let’s see how each of the models did on their testing set:

Testing the First Random Forest Model

#Python Code
y_pred = rf.predict(mp_X_test)
sns.set_context(context='poster', font_scale=1)
first_test = DataFrame({"pred.G3.keepvars" : y_pred, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.keepvars", first_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred))

R^2 value of 0.104940038879
RMSE of 4.66552400292

Here, as in all cases when making a prediction using sklearn, I use the predict method to generate the predicted values from the model using the testing set and then plot the prediction (“pred.G3.keepvars”) vs the actual values (“G3″) using the lmplot function. I like the syntax that the lmplot function from the seaborn package uses as it is simple and familiar to me from the R world (where the arguments consist of “X variable, Y Variable, dataset name, other aesthetic arguments). As you can see from the graph above and from the R^2 value, this model kind of sucks. Another thing I like here is the quality of the graph that seaborn outputs. It’s nice! It looks pretty modern, the text is very readable, and nothing looks edgy or pixelated in the plot. Okay, now let’s look at the code and output in R, using the same predictors.

#R Code
test$pred.G3.keepvars = predict(trf, test, "raw")
cor.test(test$G3, test$pred.G3.keepvars)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.G3.keepvars))$sigma
ggplot(test, aes(x=G3, y=pred.G3.keepvars)) + geom_point() + stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

R^2 value of 0.198648
RMSE of 4.148194

Well, it looks like this model sucks a bit less than the Python one. Quality-wise, the plot looks super nice (thanks again, ggplot2 and ggthemr!) although by default the alpha parameter is not set to account for overplotting. The docs page for ggplot2 suggests setting alpha=.05, but for this particular data set, setting it to .5 seems to be better.

Finally for this section, let’s look at the variable importances generated for each training model:

#Python Code
importances = DataFrame({'cols':mp_X_train.columns, 'imps':rf.feature_importances_})
print importances.sort(['imps'], ascending=False)

             cols      imps
3        failures  0.641898
0            Medu  0.064586
10          sex_F  0.043548
19  Mjob_services  0.038347
11          sex_M  0.036798
16   Mjob_at_home  0.036609
2             age  0.032722
1            Fedu  0.029266
15   internet_yes  0.016545
6     romantic_no  0.013024
7    romantic_yes  0.011134
5      higher_yes  0.010598
14    internet_no  0.007603
4       higher_no  0.007431
12        paid_no  0.002508
20   Mjob_teacher  0.002476
13       paid_yes  0.002006
18     Mjob_other  0.001654
17    Mjob_health  0.000515
8       address_R  0.000403
9       address_U  0.000330

#R Code
varImp(trf)

## rf variable importance
## 
##          Overall
## failures 100.000
## romantic  49.247
## higher    27.066
## age       17.799
## Medu      14.941
## internet  12.655
## sex        8.012
## Fedu       7.536
## Mjob       5.883
## paid       1.563
## address    0.000

My first observation is that it was obviously easier for me to get the variable importances in R than it was in Python. Next, you’ll certainly see the symptom of the dummy coding I had to do for the categorical variables. That’s no fun, but we’ll survive through this example analysis, right? Now let’s look which variables made it to the top:

Whereas failures, mother’s education level, sex and mother’s job made it to the top of the list for the Python model, the top 4 were different apart from failures in the R model.

With the understanding that the variable selection method that I used was inappropriate, let’s move on to training a Random Forest model using all predictors except the prior 2 test scores. Since I’ve already commented above on my thoughts about the various steps in the process, I’ll comment only on the differences in results in the remaining sections.

Training and Testing the Second Random Forest Model

#Python Code

#aav = almost all variables
mp_X_aav = mat_perf[mat_perf.columns[0:30]]
mp_X_train_aav = mp_X_aav[mat_perf['randu'] <= .67]
mp_X_test_aav = mp_X_aav[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train_aav.columns if mp_X_train_aav[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train_aav[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train_aav = concat([mp_X_train_aav, new_cols], axis=1)
    
# for the testing set
cat_cols = [x for x in mp_X_test_aav.columns if mp_X_test_aav[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test_aav[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test_aav = concat([mp_X_test_aav, new_cols], axis=1)

mp_X_train_aav.drop(cat_cols, inplace=True, axis=1)
mp_X_test_aav.drop(cat_cols, inplace=True, axis=1)

rf_aav = RandomForestRegressor(bootstrap=True, 
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf_aav.fit(mp_X_train_aav, mp_Y_train)

y_pred_aav = rf_aav.predict(mp_X_test_aav)
second_test = DataFrame({"pred.G3.almostallvars" : y_pred_aav, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.almostallvars", second_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_aav)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_aav))

R^2 value of 0.226587731888
RMSE of 4.3338674965

Compared to the first Python model, the R^2 on this one is more than doubly higher (the first R^2 was .10494) and the RMSE is 7.1% lower (the first was 4.6666254). The predicted vs actual plot confirms that the predictions still don’t look fantastic compared to the actuals, which is probably the main reason why the RMSE hasn’t decreased by so much. Now to the R code using the same predictors:

#R code
trf2 = train(mat_perf[mat_perf$randu <= .67,1:30], mat_perf$G3[mat_perf$randu <= .67],
            method="rf", metric="RMSE", data=mat_perf,
            trControl=ctrl, importance=TRUE)
test$pred.g3.almostallvars = predict(trf2, test, "raw")
cor.test(test$G3, test$pred.g3.almostallvars)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.g3.almostallvars))$sigma
ggplot(test, aes(x=G3, y=pred.g3.almostallvars)) + geom_point() + 
  stat_smooth() + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

R^2 value of 0.3262093
RMSE of 3.8037318

Compared to the first R model, the R^2 on this one is approximately 1.64 times higher (the first R^2 was .19865) and the RMSE is 8.3% lower (the first was 4.148194). Although this particular model is indeed doing better at predicting values in the test set than the one built in Python using the same variables, I would still hesitate to assume that the process is inherently better for this data set. Due to the randomness inherent in Random Forests, one run of the training could be lucky enough to give results like the above, whereas other times the results might even be slightly worse than what I managed to get in Python. I confirmed this, and in fact most additional runs of this model in R seemed to result in an R^2 of around .20 and an RMSE of around 4.2.

Again, let’s look at the variable importances generated for each training model:

#Python Code
importances_aav = DataFrame({'cols':mp_X_train_aav.columns, 'imps':rf_aav.feature_importances_})
print importances_aav.sort(['imps'], ascending=False)

                 cols      imps
5            failures  0.629985
12           absences  0.057430
1                Medu  0.037081
41      schoolsup_yes  0.036830
0                 age  0.029672
23       Mjob_at_home  0.029642
16              sex_M  0.026949
15              sex_F  0.026052
40       schoolsup_no  0.019097
26      Mjob_services  0.016354
55       romantic_yes  0.014043
51         higher_yes  0.012367
2                Fedu  0.011016
39     guardian_other  0.010715
37    guardian_father  0.006785
8               goout  0.006040
11             health  0.005051
54        romantic_no  0.004113
7            freetime  0.003702
3          traveltime  0.003341

#R Code
varImp(trf2)

## rf variable importance
## 
##   only 20 most important variables shown (out of 30)
## 
##            Overall
## absences    100.00
## failures     70.49
## schoolsup    47.01
## romantic     32.20
## Pstatus      27.39
## goout        26.32
## higher       25.76
## reason       24.02
## guardian     22.32
## address      21.88
## Fedu         20.38
## school       20.07
## traveltime   20.02
## studytime    18.73
## health       18.21
## Mjob         17.29
## paid         15.67
## Dalc         14.93
## activities   13.67
## freetime     12.11

Now in both cases we’re seeing that absences and failures are considered as the top 2 most important variables for predicting final math exam grade. It makes sense to me, but frankly is a little sad that the two most important variables are so negative :( On to to the third Random Forest model, containing everything from the second with the addition of the students’ marks on their second math exam!

Training and Testing the Third Random Forest Model

#Python Code

#allv = all variables (except G1)
allvars = range(0,30)
allvars.append(31)
mp_X_allv = mat_perf[mat_perf.columns[allvars]]
mp_X_train_allv = mp_X_allv[mat_perf['randu'] <= .67]
mp_X_test_allv = mp_X_allv[mat_perf['randu'] > .67]

# for the training set
cat_cols = [x for x in mp_X_train_allv.columns if mp_X_train_allv[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_train_allv[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_train_allv = concat([mp_X_train_allv, new_cols], axis=1)
    
# for the testing set
cat_cols = [x for x in mp_X_test_allv.columns if mp_X_test_allv[x].dtype == "O"]
for col in cat_cols:
    new_cols = get_dummies(mp_X_test_allv[col])
    new_cols.columns = col + '_' + new_cols.columns
    mp_X_test_allv = concat([mp_X_test_allv, new_cols], axis=1)

mp_X_train_allv.drop(cat_cols, inplace=True, axis=1)
mp_X_test_allv.drop(cat_cols, inplace=True, axis=1)

rf_allv = RandomForestRegressor(bootstrap=True, 
           criterion='mse', max_depth=2, max_features='auto',
           min_density=None, min_samples_leaf=1, min_samples_split=2,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0)
rf_allv.fit(mp_X_train_allv, mp_Y_train)

y_pred_allv = rf_allv.predict(mp_X_test_allv)
third_test = DataFrame({"pred.G3.plusG2" : y_pred_allv, "G3" : mp_Y_test})
sns.lmplot("G3", "pred.G3.plusG2", third_test, size=7, aspect=1.5)
print 'r squared value of', stats.pearsonr(mp_Y_test, y_pred_allv)[0]**2
print 'RMSE of', sqrt(mean_squared_error(mp_Y_test, y_pred_allv))

R^2 value of 0.836089929903
RMSE of 2.11895794845

Obviously we have added a highly predictive piece of information here by adding the grades from their second math exam (variable name was “G2″). I was reluctant to add this variable at first because when you predict test marks with previous test marks then it prevents the model from being useful much earlier on in the year when these tests have not been administered. However, I did want to see what the model would look like when I included it anyway! Now let’s see how predictive these variables were when I put them into a model in R:

#R Code
trf3 = train(mat_perf[mat_perf$randu <= .67,c(1:30,32)], mat_perf$G3[mat_perf$randu <= .67], 
             method="rf", metric="RMSE", data=mat_perf, 
             trControl=ctrl, importance=TRUE)
test$pred.g3.plusG2 = predict(trf3, test, "raw")
cor.test(test$G3, test$pred.g3.plusG2)$estimate[[1]]^2
summary(lm(test$G3 ~ test$pred.g3.plusG2))$sigma
ggplot(test, aes(x=G3, y=pred.g3.plusG2)) + geom_point() + 
  stat_smooth(method="lm") + scale_y_continuous(breaks=seq(0,20,4), limits=c(0,20))

R^2 value of 0.9170506
RMSE of 1.3346087

Well, it appears that yet again we have a case where the R model has fared better than the Python model. I find it notable that when you look at the scatterplot for the Python model you can see what look like steps in the points as you scan your eyes from the bottom-left part of the trend line to the top-right part. It appears that the Random Forest model in R has benefitted from the tuning process and as a result the distribution of the residuals are more homoscedastic and also obviously closer to the regression line than the Python model. I still wonder how much more similar these results would be if I had carried out the Python analysis by tuning while cross validating like I did in R!

For the last time, let’s look at the variable importances generated for each training model:

#Python Code
importances_allv = DataFrame({'cols':mp_X_train_allv.columns, 'imps':rf_allv.feature_importances_})
print importances_allv.sort(['imps'], ascending=False)

                 cols      imps
13                 G2  0.924166
12           absences  0.075834
14          school_GP  0.000000
25        Mjob_health  0.000000
24       Mjob_at_home  0.000000
23          Pstatus_T  0.000000
22          Pstatus_A  0.000000
21        famsize_LE3  0.000000
20        famsize_GT3  0.000000
19          address_U  0.000000
18          address_R  0.000000
17              sex_M  0.000000
16              sex_F  0.000000
15          school_MS  0.000000
56       romantic_yes  0.000000
27      Mjob_services  0.000000
11             health  0.000000
10               Walc  0.000000
9                Dalc  0.000000
8               goout  0.000000

#R Code
varImp(trf3)

## rf variable importance
## 
##   only 20 most important variables shown (out of 31)
## 
##            Overall
## G2         100.000
## absences    33.092
## failures     9.702
## age          8.467
## paid         7.591
## schoolsup    7.385
## Pstatus      6.604
## studytime    5.963
## famrel       5.719
## reason       5.630
## guardian     5.278
## Mjob         5.163
## school       4.905
## activities   4.532
## romantic     4.336
## famsup       4.335
## traveltime   4.173
## Medu         3.540
## Walc         3.278
## higher       3.246

Now this is VERY telling, and gives me insight as to why the scatterplot from the Python model had that staircase quality to it. The R model is taking into account way more variables than the Python model. G2 obviously takes the cake in both models, but I suppose it overshadowed everything else by so much in the Python model, that for some reason it just didn’t find any use for any other variable than absences.

Conclusion

This was fun! For all the work I did in Python, I used IPython Notebook. Being an avid RStudio user, I’m not used to web-browser based interactive coding like what IPython Notebook provides. I discovered that I enjoy it and found it useful for laying out the information that I was using to write this blog post (I also laid out the R part of this analysis in RMarkdown for that same reason). What I did not like about IPython Notebook is that when you close it/shut it down/then later reinitialize it, all of the objects that form your data and analysis are gone and all you have left are the results. You must then re-run all of your code so that your objects are resident in memory again. It would be nice to have some kind of convenience function to save everything to disk so that you can reload at a later time.

I found myself stumbling a lot trying to figure out which Python packages to use for each particular purpose and I tended to get easily frustrated. I had to keep reminding myself that it’s a learning curve to a similar extent as it was for me while I was learning R. This frustration should not be a deterrent from picking it up and learning how to do machine learning in Python. Another part of my frustration was not being able to get variable importances from my Random Forest models in Python when I was building them using cross validation and grid searches. If you have a link to share with me that shows an example of this, I’d be happy to read it.

I liked seaborn and I think if I spend more time with it then perhaps it could serve as a decent alternative to graphing in ggplot2. That being said, I’ve spent so much time using ggplot2 that sometimes I wonder if there is anything out there that rivals its flexibility and elegance!

The issue I mentioned above with categorical variables is annoying and it really makes me wonder if using a Tree based R model would intrinsically be superior due to its automatic handling of categorical variables compared with Python, where you need to one-hot encode these variables.

All in all, I hope this was as useful and educational for you as it was for me. It’s important to step outside of your comfort zone every once in a while :)

To leave a comment for the author, please follow the link and comment on his blog: Data Until I Die!.

↧

RevoScaleR’s Naive Bayes Classifier rxNaiveBayes()

May 28, 2015, 8:30 am

≫ Next: R: the Excel Connection

≪ Previous: An R Enthusiast Goes Pythonic!

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert,

Because of its simplicity and good performance over a wide spectrum of classification problems the Naïve Bayes classifier ought to be on everyone's short list of machine learning algorithms. Now, with version 7.4 we have a high performance Naïve Bayes classifier in Revolution R Enterprise too. Like all Parallel External Memory Algorithms (PEMAs) in the RevoScaleR package, rxNaiveBayes is an inherently parallel algorithm that may be distributed across Microsoft HPC, Linux and Hadoop clusters and may be run on data in Teradata databases.

The following example shows how to get started with rxNaiveBayes() on a moderately sized data in your local environment. It uses the Mortgage data set which may be downloaded for the Revolution Analytics data set repository. The first block of code imports the .csv files for the years 2000 through 2008 and concatenates them into a single training file in the .XDF binary format. Then, the data for the year 2009 is imported to a test file that will be used for making predictions

#-----------------------------------------------
# Set up the data location information
bigDataDir <- "C:/Data/Mortgage" 
mortCsvDataName <- file.path(bigDataDir,"mortDefault") 
trainingDataFileName <- "mortDefaultTraining" 
mortCsv2009 <- paste(mortCsvDataName, "2009.csv", sep = "") 
targetDataFileName <- "mortDefault2009.xdf"
 
 
#--------------------------------------- 
# Import the data from multiple .csv files into2 .XDF files
# One file, the training file containing data from the years
# 2000 through 2008.
# The other file, the test file, containing data from the year 2009.
 
defaultLevels <- as.character(c(0,1)) 
ageLevels <- as.character(c(0:40)) 
yearLevels <- as.character(c(2000:2009)) 
colInfo <- list(list(name  = "default", type = "factor", levels = defaultLevels), 
	       list(name = "houseAge", type = "factor", levels = ageLevels), 
		   list(name = "year", type = "factor", levels = yearLevels)) 
 
append= FALSE 
for (i in 2000:2008) { 
     importFile <- paste(mortCsvDataName, i, ".csv", sep = "")     
	 rxImport(inData = importFile, outFile = trainingDataFileName,     
		      colInfo = colInfo, append = append, overwrite=TRUE)     
	          append = TRUE }

The rxGetInfo() command shows that the training file has 9 million observation with 6 variables and the test file contains 1 million observations. The binary factor variable, default, which indicates whether or not an individual defaulted on the mortgage will be the target variable in the classification exercise.

rxGetInfo(trainingDataFileName, getVarInfo=TRUE)
#File name: C:UsersjrickertDocumentsRevolutionNaiveBayesmortDefaultTraining.xdf 
#Number of observations: 9e+06 
#Number of variables: 6 
#Number of blocks: 18 
#Compression type: zlib 
#Variable information: 
#Var 1: creditScore, Type: integer, Low/High: (432, 955)
#Var 2: houseAge
       #41 factor levels: 0 1 2 3 4 ... 36 37 38 39 40
#Var 3: yearsEmploy, Type: integer, Low/High: (0, 15)
#Var 4: ccDebt, Type: integer, Low/High: (0, 15566)
#Var 5: year
       #10 factor levels: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
#Var 6: default
       #2 factor levels: 0 1
 
rxImport(inData = mortCsv2009, outFile = targetDataFileName, colInfo = colInfo)
rxGetInfo(targetDataFileName)
#> rxGetInfo(targetDataFileName)
#File name: C:UsersjrickertDocumentsRevolutionNaiveBayesmortDefault2009.xdf 
#Number of observations: 1e+06 
#Number of variables: 6 
#Number of blocks: 2 
#Compression type: zlib

Next, the rxNaiveBayes() function is used to fit a classification model with default as the target variable and year, credit score, years employed and credit card debt as predictors. Note that the smoothingFactor parameter instructs the classifier to perform Laplace smoothing. (Since the conditional probabilities are being multiplied in the model, adding a small number to 0 probabilities, precludes missing categories from wiping out the calculation.) Also note that it took about 1.9 seconds to fit the model on my modest Lenovo Thinkpad which is powered by an Intel i7 5600U processor and equipped with 8GB of RAM.

# Build the classifier on the training data
mortNB <- rxNaiveBayes(default ~ year + creditScore + yearsEmploy + ccDebt,  
	                   data = trainingDataFileName, smoothingFactor = 1) 
 
 
#Rows Read: 500000, Total Rows Processed: 8500000, Total Chunk Time: 0.110 seconds
#Rows Read: 500000, Total Rows Processed: 9000000, Total Chunk Time: 0.125 seconds 
#Computation time: 1.875 seconds.

Looking at the model object we see that conditional probabilities are calculated for all of the factor (categorical) variables and means and standard deviations are calculated for numeric variables. rxNaiveBayes() follows the standard practice of assuming that these variables follow Gaussian distributions.

#> mortNB 
#
#Naive Bayes Classifier
#
#Call:
#rxNaiveBayes(formula = default ~ year + creditScore + yearsEmploy + 
    #ccDebt, data = trainingDataFileName, smoothingFactor = 1)
#
#A priori probabilities:
#default
          #0           1 
#0.997242889 0.002757111 
#
#Predictor types:
     #Variable    Type
#1        year  factor
#2 creditScore numeric
#3 yearsEmploy numeric
#4      ccDebt numeric
#
#Conditional probabilities:
#$year
       #year
#default         2000         2001         2002         2003         2004
      #0 1.113034e-01 1.110692e-01 1.112866e-01 1.113183e-01 1.113589e-01
      #1 4.157267e-02 1.262488e-01 4.765549e-02 3.617467e-02 2.151144e-02
       #year
#default         2005         2006         2007         2008         2009
      #0 1.113663e-01 1.113403e-01 1.111888e-01 1.097681e-01 1.114182e-07
      #1 1.885272e-02 2.823880e-02 8.302449e-02 5.966806e-01 4.028360e-05
#
#$creditScore
     #Means   StdDev
#0 700.0839 50.00289
#1 686.5243 49.71074
#
#$yearsEmploy
     #Means   StdDev
#0 5.006873 2.009446
#1 4.133030 1.969213
#
#$ccDebt
     #Means   StdDev
#0 4991.582 1976.716
#1 9349.423 1459.797

Next, we use the rxPredict() function to predict default values for the test data set. Setting the type = "prob" parameter produced the table of probabilities below. Using the default for type would have produced only the default_Pred column of forecasts. In a multi-value forecast, the probability table would contain entries for all possible values.

# use the model to predict wheter a loan will default on the test data
mortNBPred <- rxPredict(mortNB, data = targetDataFileName, type="prob") 
#Rows Read: 500000, Total Rows Processed: 500000, Total Chunk Time: 3.876 
# secondsRows Read: 500000, Total Rows Processed: 1000000, Total Chunk Time: 2.280 seconds

names(mortNBPred) <- c("prob_0","prob_1")
mortNBPred$default_Pred <- as.factor(round(mortNBPred$prob_1))
 
#head(mortNBPred)
     #prob_0      prob_1 default_Pred
#1 0.9968860 0.003114038            0
#2 0.9569425 0.043057472            0
#3 0.5725627 0.427437291            0
#4 0.9989603 0.001039729            0
#5 0.7372746 0.262725382            0
#6 0.4142266 0.585773432            1

In this next step, we tabulate the actual vs. predicted values for the test data set to produce the "confusion matrix" and an estimate of the misclassification rate.

# Tabulate the actual and predicted values
actual_value <- rxDataStep(targetDataFileName,maxRowsByCols=6000000)[["default"]]
predicted_value <- mortNBPred[["default_Pred"]]
results <- table(predicted_value,actual_value) 
#> results
               #actual_value
#predicted_value      0      1
              #0 877272   3792
              #1  97987  20949
 
pctMisclassified <- sum(results[1:2,2])/sum(results[1:2,1])*100 
pctMisclassified 
#[1] 2.536865

Since the results object produced above is an ordinary table we can use the confusionMatrix() from the caret package to produce additional performance measures.

# Use confusionMatrix from the caret package to look at the results
library(caret)
library(e1071)
confusionMatrix(results,positive="1")
 
#Confusion Matrix and Statistics
#
               #actual_value
#predicted_value      0      1
              #0 877272   3792
              #1  97987  20949
                                          #
               #Accuracy : 0.8982          
                 #95% CI : (0.8976, 0.8988)
    #No Information Rate : 0.9753          
    #P-Value [Acc > NIR] : 1               
                                          #
                  #Kappa : NA              
 #Mcnemar's Test P-Value : <2e-16          
                                          #
            #Sensitivity : 0.84673         
            #Specificity : 0.89953         
         #Pos Pred Value : 0.17614         
         #Neg Pred Value : 0.99570         
             #Prevalence : 0.02474         
         #Detection Rate : 0.02095         
   #Detection Prevalence : 0.11894         
      #Balanced Accuracy : 0.87313         
                                          #
       #'Positive' Class : 1

Finally, we use the rxhist() function to look at a histogram (not shown) of the actual values to get a feel for how unbalanced the data set is, and then use the rxRocCurve() function to produce the ROC Curve.

roc_data <- data.frame(mortNBPred$prob_1,as.integer(actual_value)-1)
names(roc_data) <- c("predicted_value","actual_value")
head(roc_data)
hist(roc_data$actual_value)
rxRocCurve("actual_value","predicted_value",roc_data,title="ROC Curve for Naive Bayes Mortgage Defaults Model")

Here we have a "picture-perfect" representation of how one hopes a classifier will perform.

For more on the Naïve Bayes classification algorithm have a look at these two papers referenced in the Wikipedia link above.

The first is a prescient, 1961 paper by Marvin Minskey that explicitly calls attention to the naïve, independence assumption. The second paper provides some theoretical arguments for why the overall excellent performance of the Naïve Bayes Classifier is not accidental.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

R: the Excel Connection

May 29, 2015, 8:35 am

≫ Next: Modeling Contagion Using Airline Networks in R

≪ Previous: RevoScaleR’s Naive Bayes Classifier rxNaiveBayes()

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

By Andy Nicholls, Head of Consulting

As companies increasingly look beyond the scope of what is logistically possible in Excel more and more companies are approaching Mango looking for help with connecting to Excel from R. With over 6,500 packages now on CRAN it should come as no surprise that there are quite a few packages that have be written in order to connect to Excel from R. So which is the best? Unfortunately it really does depend on what you want to do but here’s a quick guide to some of the main packages available from CRAN.

The “All-Rounders”

Overview

There are four “all-rounder” packages that can both read and write to Excel: XLConnect written and maintained by Mirai Solutions; xlsx by Adrian Dragulescu; openxlsx by Alexander Walker; excel.link by Gregory Demin. With each of these packages it is possible to follow a structured workflow in which a user connects to a workbook, extracts data, processes the data in R and then writes back data, graphics or even Excel formulae back to the workbook. It is possible to add new sheets, recolour, add formulae and so on. There are however some important differences that users should be aware of.

Formal Comparison of “All-Rounders”

At Mango we regularly run tests to assess developments around these packages in order to assess their suitability for various projects. The test is fairly straightforward and simply involves connecting to a 4MB file, reading in some different data formats and writing back data and graphics to the spreadsheet. The results are shown in the table below.

General Observations

For the general consumer I personally think XLConnect is the best all-rounder though functionally there’s not much difference between this and xlsx. Both XLConnect and xlsx depend on rJava but the Java elements are hidden away from the user with XLConnect and primarily for that reason I have a slight personal preference for XLConnect. If your workbook is full of “=sum()” or other formulae then it is also worth noting that XLConnect will read in the result of the calculation whereas xlsx interprets the formula as NA. Personally I am not a big fan of the XLConnect header formatting when writing data to Excel however. If you’re an xlsx pro user though there’s certainly no reason that I can think of to drop the package and switch to XLConnect.

One potential downside of both XLConnect and xlsx is their Java dependency. Users with medium to large workbooks may soon encounter “Java heap space” error messages as memory is consumed in huge quantities. Those working for companies where IT have ultimate rule over your laptop may also find it difficult to get set up in the first place by ensuring that Java is installed and available in the right location. This is essentially a binary hurdle that you can either overcome or you can’t. If you can it’s worth continuing as these are geat packages.

openxlsx is fairly new and a package that I hadn’t tried until recently. I have to say I’ve been impressed with what I’ve seen thus far. Crucially it does not have a dependency on Java though Windows users will need a zip application (eg Rtools). The functionality and consistency is not quite at the level of XLConnect or xlsx yet. For example the read.xlsx has startRow argument but no startCol. You can use the argument cols to specify start and end columns but it feels more like a workaround in my opinion. Dates are read in as numeric by default as well which most users will find frustrating (unlike XLConnect or xlsx which convert dates to POSIXct). Plus it’s noticabley slower than the aforementioned packages for large workbooks. Despite all of that however, it is the only all-rounder package that I’ve used that can easily connect to and import a protected sheet from a large “xlsm” test file that we have at Mango, pumped full of VBA. And for what it’s worth it is also the only all-rounder that interprets a “#NUM” in Excel as NaN in R; XLConnect and xlsx read such values as NA whilst excel.link fails with an error. Graphics may also be written directly to Excel without having to generate an intermediate file (though often an intermediate file can be a useful output).

That leaves excel.link. I left this one until last as it’s very different from the other three all-rounder packages. excel.link uses RDCOMClient and any edits to a workbook are live edits in an open workbook, making it easier to develop a script. It also passes Mango’s “xlsm” test, albeit only after the protected sheet is unprotected. However it’s quite tough to pick up for the lay-user and for some reason it doesn’t appear to be able to read hidden sheets. Speed is also an issue that users will notice if their script is particularly long and the workbook large. That said it certainly offers something different to the other packages however and if you learn it well then it’s a powerful ally.

Reading Structured Data from Excel

The all-rounders are great but if you are fortunate enough to have structured data, which in this context means your datasets are stored in the top left-hand corner of separate tabs, then there are a few other options to consider which may be much faster than the ‘all-rounder’ packages for reading in data. The multi-purpose RODBC is very mature and really easy to use but some users can be limited by their driver set-up. RJDBC is a viable, albeit slower alternative that has it’s own (Java) restriction. Hadley Wickham has a habit of finding problems that need solving and for those who are struggling with either of these packages readxl is the new kid on the block for structured data that everyone will probably be using by the end of the year.

What Else is There?

There are further specific packages available such as Guido van Steen’s dataframes2xls package which uses Python’s pyexcelerator to write to xls and Marc Schartz’s WriteXLS which uses Perl to write to xls and xlsx files. Another Perl implemntation is Gregory Warnes’s gdata which can be used to read data from Excel. And the list goes on but I have to stop writing at some point!

Conclusion

If you’ve found a good package stick to it! If you’re starting out it’s worth considering the structure of your data and the end users of your code. Are they going to have all the freedom you have to configure Java, install drivers and so on? There are plenty of packages available and most of them are pretty good so long as you understand their limitations.

To leave a comment for the author, please follow the link and comment on his blog: Mango Solutions.

↧

Modeling Contagion Using Airline Networks in R

May 30, 2015, 12:24 pm

≫ Next: Cluster analysis on earthquake data from USGS

≪ Previous: R: the Excel Connection

(This article was first published on Stable Markets » R, and kindly contributed to R-bloggers)

I first became interested in networks when reading Matthew O’Jackson’s 2010 paper describing networks in economics. At some point during the 2014 ebola outbreak, I became interested in how the disease could actually come to the U.S. I was caught up with work/classes at the time, but decided to use airline flight data to at least explore the question.

This is the same dataset I used here. The datasource is given in that post.

I assumed that the disease had a single origin (Liberia) and wanted to explore the question of how fast the disease could travel to the U.S.

A visualization of the network can be seen below. Each node is a country and each edge represents an existing airline route from one country to another. Flights that take off and land in the same country are omitted to avoid clutter.

Each vertex is a country and each edge represents and existing airline route between two countries. Flights beginning and ending in the same country are not represented for clarity.

Communities and Homophily

I used a spinglass algorithm to detect “communities” of countries, i.e. sets of countries with many flights between themselves, but few flights between them and countries not in the set. Roughly speaking, the algorithm tended to group countries in the same continent together. However, this is not always the case. For example, France was was placed in the same community as several African countries, due to the close relationships with its former colonies. Roughly speaking, this network seems to exhibit homophily – where countries on the same continent tend to be connected more with each other than with countries off their continent.

Liberia, the US, and Degree Distribution

The labels are unclear in the plot, but the United States and Liberia are in two separate communities, which may lead us to believe that Ebola spreading from the former to the latter would be unlikely. In fact, the degrees (number of countries a given country is connected to) of the countries differ greatly, which would also support this intuition. The US is connected to 186 other countries, whereas Liberia is connected to only 12. The full degree distribution is shown below. It roughly follows a power law, which, according to wikipedia, is what we should expect. Note that the approximation is asymptotic, which could be why this finite sample is off. According to the degree distribution, half of all countries are connected to 27 other countries. Liberia falls far below the median and the US falls far above the median.

Degree distribution of airline connections and the power law. If a network’s degree distribution follows a power law, we say it is a “scale-free” network.

Small Worlds

Let’s zoom in and look at Liberia’s second-degree connections:

Liberia's Airline connections. Sierra Leone and Cote d'Ivoire have no direct connects to the US, so their connections are not shown. — Liberia’s Airline connections. Sierra Leone and Cote d’Ivoire have no direct connects to the US, so their connections are not shown.

Even though they’re in two different communities, Liberia and the US are only two degrees of separation away. This is generally case for all countries. If, for each node, we calculated the shortest path between it and every other node, the average shortest distance would be about 2 (specifically 2.3) hops. This is called the small world phenomenon. On average, every country is 2 hops away from every other country. Many networks exhibit this phenomenon largely due to “hubs” – countries (more generally, nodes) with lots of connections to other countries. For example, you can imagine that Charles de Gaulle airport in France is a hub, which links countries in the US, Eastern Europe, Asia, and Africa. The existence of these hubs makes it possible to get from one country to another with very few transfers.

Contagion

The close-up network above shows that if ebola were to spread to the US, it might be through Nigeria, Ghana, Morocco, and Belgium. If we knew the proportion of flights that went from liberia to each of these countries and from each of these countries to the US, we could estimate the probability of Ebola spreading for each route. I don’t have time to do this, although it’s certainly possible given the data.

Of course, this is a huge simplification for many reasons. Even though Sierra Leon, for example, doesn’t have a direct connection to the US, it could be connected to other countries that are connected to the US. This route could have a very high proportion of flights end up in the US. So we would need to account for this factor.

There are also several epidemiological parameters that could change how quickly the disease spreads. For example, the length of time from infection to symptoms is important. If the infected don’t show symptoms until a week after infection, then they can’t be screened and contained as easily. They could infect many others before showing symptoms.

The deadliness of the disease is also important. If patients die within hours of being infected, then the disease can’t spread very far. To take the extreme, consider a patient dies within a second of being infected. Then there’s very little time for him to infect others.

Finally, we assumed a single origin. If the disease is already present in several countries when we run this analysis, then we would have to factor in multiple origins.

routes=read.table('.../routes.dat',sep=',')
ports=read.table('.../airports.dat',sep=',')

library(igraph)

# for each flight, get the country of the airport the plane took off from and landed at.
ports=ports[,c('V4','V5')]
names(ports)=c('country','airport')

routes=routes[,c('V3','V5')]
names(routes)=c('from','to')

m=merge(routes,ports,all.x=TRUE,by.x=c('from'),by.y=c('airport'))
names(m)[3]=c('from_c')
m=merge(m,ports,all.x=TRUE,by.x=c('to'),by.y=c('airport'))
names(m)[4]=c('to_c')

m$count=1
# create a unique country to country from/to route ID
m$id=paste(m$from_c,m$to_c,sep=',')

# see which routes are flown most frequently
a=aggregate(m$count,list(m$id),sum)
names(a)=c('id','flights')
a$fr=substr(a$id,1,regexpr(',',a$id)-1)
a$to=substr(a$id,regexpr(',',a$id)+1,100)
a=a[,2:4]

a$perc=(a$flights/sum(a$flights))*100

# create directed network graph
a=a[!(a[,2]==a[,3]),]
mat=as.matrix(a[,2:3])

g=graph.data.frame(mat, directed = T)

edges=get.edgelist(g)
deg=degree(g,directed=TRUE)
vv=V(g)

# use spinglass algo to detect community
set.seed(9)
sgc = spinglass.community(g)
V(g)$membership=sgc$membership
table(V(g)$membership)

V(g)[membership==1]$color = 'pink'
V(g)[membership==2]$color = 'darkblue'
V(g)[membership==3]$color = 'darkred'
V(g)[membership==4]$color = 'purple'
V(g)[membership==5]$color = 'darkgreen'

plot(g,
     main='Airline Routes Connecting Countries',
     vertex.size=5,
     edge.arrow.size=.1,
     edge.arrow.width=.1,
     vertex.label=ifelse(V(g)$name %in% c('Liberia','United States'),V(g)$name,''),
     vertex.label.color='black')
legend('bottomright',fill=c('darkgreen','darkblue', 'darkred', 'pink', 'purple'),
       c('Africa', 'Europe', 'Asia/Middle East', 'Kiribati, Marshall Islands, Nauru', 'Americas'),
       bty='n')

# plot degree distribution
dplot=degree.distribution(g,cumulative = TRUE)

plot(dplot,type='l',xlab='Degree',ylab='Frequency',main='Degree Distribution of Airline Network',lty=1)
lines((1:length(dplot))^(-.7),type='l',lty=2)
legend('topright',lty=c(1,2),c('Degree Distribution','Power Law with x^(-.7)'),bty='n')

# explore membership...regional patterns exist
cc=cbind(V(g)$name,V(g)$membership)
tt=cc[cc[,2]==5,]

# explort connection from Liberia to United States
m=mat[mat[,1]=='Liberia',]

t=mat[mat[,1] %in% m[,2],]
tt=t[t[,2]=='United States',]

# assess probabilities

lib=a[a$fr=='Liberia',]
lib$prob=lib$flights/sum(lib$flights)
  # most probable route from liberia is Ghana

vec=c(tt[,1],'Liberia')
names(vec)=NULL

g2=graph.data.frame(mat[(mat[,1] %in% vec & mat[,2] == 'United States') | (mat[,1]=='Liberia'),], directed = T)
V(g2)$color=c('darkblue','darkgreen','darkgreen','darkgreen','darkgreen','purple','darkgreen','darkgreen')

plot(g2,
     main='Airline Connections from Liberia to the United States',
     vertex.size=5,
     edge.arrow.size=1,
     edge.arrow.width=.5,
     vertex.label.color='black')
legend('bottomright',fill=c('darkgreen','darkblue','purple'),
       c('Africa', 'Europe', 'Americas'),
       bty='n')


aa=a[a$fr %in% tt[,1],]
sum=aggregate(aa$flights,list(aa$fr),sum)

bb=a[a$fr %in% tt[,1] & a$to=='United States',]

fin=data.frame(bb$fr,sum$x,bb$flights,bb$flights/sum$x)

s=shortest.paths(g)
mean(s)

To leave a comment for the author, please follow the link and comment on his blog: Stable Markets » R.

↧

Cluster analysis on earthquake data from USGS

June 1, 2015, 9:14 am

≫ Next: IPython Markdown Opportunities in IPython Notebooks and Rstudio

≪ Previous: Modeling Contagion Using Airline Networks in R

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

Theoretical Background

In some cases we would like to classify the events we have in our dataset based on their spatial location or on some other data. As an example we can return to the epidemiological scenario in which we want to determine if the spread of a certain disease is affected by the presence of a particular source of pollution. With the G function we are able to determine quantitatively that our dataset is clustered, which means that the events are not driven by chance but by some external factor. Now we need to verify that indeed there is a cluster of points located around the source of pollution, to do so we need a form of classification of the points.

Cluster analysis refers to a series of techniques that allow the subdivision of a dataset into subgroups, based on their similarities (James et al., 2013). There are various clustering method, but probably the most common is k-means clustering. This technique aims at partitioning the data into a specific number of clusters, defined a priori by the user, by minimizing the within-clusters variation. The within-cluster variation measures how much each event in a cluster k, differs from the others in the same cluster k. The most common way to compute the differences is using the squared Euclidean distance (James et al., 2013), calculated as follow:

Where W_k (I use the underscore to indicate the subscripts) is the within-cluster variation for the cluster k, n_k is the total number of elements in the cluster k, p is the total number of variables we are considering for clustering and x_ij is one variable of one event contained in cluster k. This equation seems complex, but it actually quite easy to understand. To better understand what this means in practice we can take a look at the figure below.

For the sake of the argument we can assume that all the events in this point pattern are located in one unique cluster k, therefore n_k is 15. Since we are clustering events based on their geographical location we are working with two variables, i.e. latitude and longitude; so p is equal to two. To calculate the variance for one single pair of points in cluster k, we simply compute the difference between the first point’s value of the first variable, i.e. its latitude, and the second point value of the same variable; and we do the same for the second variable. So the variance between point 1 and 2 is calculated as follow:

where V_(1:2) is the variance of the two points. Clearly the geographical position is not the only factor that can be used to partition events in a point pattern; for example we can divide earthquakes based on their magnitude. Therefore the two equations can be adapted to take more variables and the only difference is in the length of the linear equation that needs to be solved to calculate the variation between two points. The only problem may be in the number of equations that would need to be solved to obtain a solution. This however is something that the k-means algorithms solves very efficiently.

The algorithm starts by randomly assigning each event to a cluster, then it calculates the mean centre of each cluster (we looked at what the mean centre is in the post: Introductory Point Pattern Analysis of Open Crime Data in London). At this point it calculates the Euclidean distance between each event and the two clusters and reassigns them to a new cluster, based on the closest mean centre, then it recalculates the mean centres and it keeps going until the cluster elements stop changing. As an example we can look at the figure below, assuming we want to divide the events into two clusters.

In Step 1 the algorithm assigns each event to a cluster at random. It then computes the mean centres of the two clusters (Step 2), which are the large black and red circles. Then the algorithm calculate the Euclidean distance between each event and the two mean centres, and reassign the events to new clusters based on the closest mean centre, so if a point was first in cluster one but it is closer to the mean centre of cluster two it is reassigned to the latter. Subsequently the mean centres are computed again for the new clusters (Step 4). This process keeps going until the cluster elements stop changing.

Practical Example

In this experiment we will look at a very simple exercise of cluster analysis of seismic events downloaded from the USGS website. To complete this exercise you would need the following packages: sp, raster, plotrix, rgeos, rgdal and scatterplot3d

I already mentioned in the post Downloading and Visualizing Seismic Events from USGS how to download the open data from the United States Geological Survey, so I will not repeat the process. The code for that is the following.

URL <- "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
Earthquake_30Days <- read.table(URL, sep = ",", header = T)
 
 
#Download, unzip and load the polygon shapefile with the countries' borders
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")

I also included the code to download the shapefile with the borders of all countries.

For the cluster analysis I would like to try to divide the seismic events by origin. In other words I would like to see if there is a way to distinguish between events close to plates, or volcanoes or other faults. In many cases the distinction is hard to make since many volcanoes are originated from subduction, e.g. the Andes, where plates and volcanoes are close to one another and the algorithm may find difficult to distinguish the origins. In any case I would like to explore the use of cluster analysis to see what the algorithm is able to do.

Clearly the first thing we need to do is download data regarding the location of plates, faults and volcanoes. We can find shapefiles with these information at the following website: http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/

The data are provided in zip files, so we need to extract them and load them in R. There are some legal restrictions to use these data. They are distributed by ESRI and can be used in conjunction with the book: "Mapping Our World: GIS Lessons for Educators.". Details of the license and other information may be found here: http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Earthquakes/plat_lin.htm#getacopy

If you have the rights to download and use these data for your studies you can download them directly from the web with the following code. We already looked at code to do this in previous posts so I would not go into details here:

dir.create(paste(getwd(),"/GeologicalData",sep=""))
 
#Faults
download.file("http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/FAULTS.zip",destfile="GeologicalData/FAULTS.zip")
unzip("GeologicalData/FAULTS.zip",exdir="GeologicalData")
 
faults <- shapefile("GeologicalData/FAULTS.SHP")
 
 
#Plates
download.file("http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/PLAT_LIN.zip",destfile="GeologicalData/plates.zip")
unzip("GeologicalData/plates.zip",exdir="GeologicalData")
 
plates <- shapefile("GeologicalData/PLAT_LIN.SHP")
 
 
#Volcano
download.file("http://legacy.jefferson.kctcs.edu/techcenter/gis%20data/World/Zip/VOLCANO.zip",destfile="GeologicalData/VOLCANO.zip")
unzip("GeologicalData/VOLCANO.zip",exdir="GeologicalData")
 
volcano <- shapefile("GeologicalData/VOLCANO.SHP")

The only piece of code that I never presented before is the first line, to create a new folder. It is pretty self explanatory, we just need to create a string with the name of the folder and R will create it. The rest of the code downloads data from the address above, unzip them and load them in R.

We have not yet transform the object Earthquake_30Days, which is now a data.frame, into a SpatioPointsDataFrame. The data from USGS contain seismic events that are not only earthquakes but also related to mining and other events. For this analysis we want to keep only the events that are classified as earthquakes, which we can do with the following code:

Earthquakes <- Earthquake_30Days[paste(Earthquake_30Days$type)=="earthquake",]
coordinates(Earthquakes)=~longitude+latitude

This extracts only earthquakes and transform the object into a SpatialObject.

We can create a map that shows the earthquakes alongside all the other geological elements we downloaded using the following code, which saves directly the image in jpeg:

jpeg("Earthquake_Origin.jpg",4000,2000,res=300)
plot(plates,col="red")
plot(polygons,add=T)
title("Earthquakes in the last 30 days",cex.main=3)
lines(faults,col="dark grey")
points(Earthquakes,col="blue",cex=0.5,pch="+")
points(volcano,pch="*",cex=0.7,col="dark red")
legend.pos <- list(x=20.97727,y=-57.86364)
 
legend(legend.pos,legend=c("Plates","Faults","Volcanoes","Earthquakes"),pch=c("-","-","*","+"),col=c("red","dark grey","dark red","blue"),bty="n",bg=c("white"),y.intersp=0.75,title="Days from Today",cex=0.8) 
 
text(legend.pos$x,legend.pos$y+2,"Legend:")
dev.off()

This code is very similar to what I used here so I will not explain it in details. We just added more elements to the plot and therefore we need to remember that R plots in layers one on top of the other depending on the order in which they appear on the code. For example, as you can see from the code, the first thing we plot are the plates, which will be plotted below everything, even the borders of the polygons, which come second. You can change this just by changing the order of the lines. Just remember to use the option add=T correctly.
The result is the image below:

Before proceeding with the cluster analysis we first need to fix the projections of the SpatialObjects. Luckily the object polygons was created from a shapefile with the projection data attached to it, so we can use it to tell R that the other objects have the same projection:

projection(faults)=projection(polygons)
projection(volcano)=projection(polygons)
projection(Earthquakes)=projection(polygons)
projection(plates)=projection(polygons)

Now we can proceed with the cluster analysis. As I said I would like to try and classify earthquakes based on their distance between the various geological features. To calculate this distance we can use the function gDistance in the package rgeos.
These shapefiles are all unprojected, and their coordinates are in degrees. We cannot use them directly with the function gDistance because it deals only with projected data, so we need to transform them using the function spTransform (in the package rgdal). This function takes two arguments, the first is the SpatialObject, which needs to have projection information, and the second is the data regarding the projection to transform the object into. The code for doing that is the following:

volcanoUTM <- spTransform(volcano,CRS("+init=epsg:3395"))
faultsUTM <- spTransform(faults,CRS("+init=epsg:3395"))
EarthquakesUTM <- spTransform(Earthquakes,CRS("+init=epsg:3395"))
platesUTM <- spTransform(plates,CRS("+init=epsg:3395"))

The projection we are going to use is the standard mercator, details here: http://spatialreference.org/ref/epsg/wgs-84-world-mercator/

NOTE:
the plates object presents lines also along the borders of the image above. This is something that R cannot deal with, so I had to remove them manually from ArcGIS. If you want to replicate this experiment you have to do the same. I do not know of any method in R to do that quickly, if you know it please let me know in the comment section.

We are going to create a matrix of distances between each earthquake and the geological features with the following loop:

distance.matrix <- matrix(0,nrow(Earthquakes),7,dimnames=list(c(),c("Lat","Lon","Mag","Depth","DistV","DistF","DistP")))
for(i in 1:nrow(EarthquakesUTM)){
sub <- EarthquakesUTM[i,]
dist.v <- gDistance(sub,volcanoUTM)
dist.f <- gDistance(sub,faultsUTM)
dist.p <- gDistance(sub,platesUTM)
distance.matrix[i,] <- matrix(c(sub@coords,sub$mag,sub$depth,dist.v,dist.f,dist.p),ncol=7)
}
 
 
distDF <- as.data.frame(distance.matrix)

In this code we first create an empty matrix, which is usually wise to do since R already allocates the RAM it would need for the process and it should also be faster to fill it compared to create a new matrix directly from inside the loop. In the loop we iterate through the earthquakes and for each we calculate its distance to the geological features. Finally we change the matrix into a data.frame.

The next step is finding the correct number of clusters. To do that we can follow the approach suggested by Matthew Peeples here: http://www.mattpeeples.net/kmeans.html and also discussed in this stackoverflow post: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters

The code for that is the following:

mydata <-  scale(distDF[,5:7])
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 2:15) wss[i] <- sum(kmeans(mydata,
                                       centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

We basically calculate clusters between 2 and 15 and we plot the number of clusters against the "within clusters sum of squares", which is the parameters that is minimized during the clustering process. Generally this quantity decreases very fast up to a point, and then basically stops decreasing. We can see this behaviour from the plot below generated from the earthquake data:

As you can see for 1 and 2 clusters the sum of squares is high and decreases fast, then it decreases between 3 and 5, and then it gets erratic. So probably the best number of clusters would be 5, but clearly this is an empirical method so we would need to check other numbers and test whether they make more sense.

To create the clusters we can simply use the function kmeans, which takes two arguments: the data and the number of clusters:

clust <- kmeans(mydata,5)
distDF$Clusters <- clust$cluster

We can check the physical meaning of the clusters by plotting them against the distance from the geological features using the function scatterplot3d, in the package scatterplot3d:

scatterplot3d(distDF$DistV,xlab="Distance to Volcano",distDF$DistF,ylab="Distance to Fault",distDF$DistP,zlab="Distance to Plate", color = clust$cluster,pch=16,angle=120,scale=0.5,grid=T,box=F)

This function is very similar to the standard plot function, but it takes three arguments instead of just two. I wrote the line of code distinguishing between the three axis to better understand it. So we have the variable for x, and the corresponding axis label, and so on for each axis. Then we set the colours based on clusters, and the symbol with pch, as we would do in plot. The last options are only available here: we have the angle between x and y axis, the scale of the z axis compared to the other two, then we plot a grid on the xy plane and we do not plot a box all around the plot. The result is the following image:

It seems that the red and green cluster are very similar, they differ only because red is closer to volcanoes than faults and vice-versa for the green. The black cluster seems only to be farther away from volcanoes. Finally the blue and light blue clusters seem to be close to volcanoes and far away from the other two features.

We can create an image with the clusters using the following code:

clustSP <- SpatialPointsDataFrame(coords=Earthquakes@coords,data=data.frame(Clusters=clust$cluster))
 
jpeg("Earthquake_Clusters.jpg",4000,2000,res=300)
plot(plates,col="red")
plot(polygons,add=T)
title("Earthquakes in the last 30 days",cex.main=3)
lines(faults,col="dark grey")
points(volcano,pch="x",cex=0.5,col="yellow")
legend.pos <- list(x=20.97727,y=-57.86364)
 
points(clustSP,col=clustSP$Clusters,cex=0.5,pch="+")
legend(legend.pos,legend=c("Plates","Faults","Volcanoes","Earthquakes"),pch=c("-","-","x","+"),col=c("red","dark grey","dark red","blue"),bty="n",bg=c("white"),y.intersp=0.75,title="Days from Today",cex=0.6) 
 
text(legend.pos$x,legend.pos$y+2,"Legend:")
 
dev.off()

I created the object clustSP based on the coordinates in WGS84 so that I can plot everything as before. I also plotted the volcanoes in yellow, so that differ from the red cluster. The result is the following image:

To conclude this experiment I would also like to explore the relation between the distance to the geological features and the magnitude of the earthquakes. To do that we need to identify the events that are at a certain distance from each geological feature. We can use the function gBuffer, again available from the package rgeos, for this job.

volcano.buffer <- gBuffer(volcanoUTM,width=1000)
volcano.over <- over(EarthquakesUTM,volcano.buffer)
 
plates.buffer <- gBuffer(platesUTM,width=1000)
plates.over <- over(EarthquakesUTM,plates.buffer)
 
faults.buffer <- gBuffer(faultsUTM,width=1000)
faults.over <- over(EarthquakesUTM,faults.buffer)

This function takes minimum two arguments, the SpatialObject and the maximum distance (in metres because it requires data to be projected) to reach with the buffer, option width. The results is a SpatialPolygons object that include a buffer around the starting features; for example if we start with a point we end up with a circle of radius equal to width. In the code above we first created these buffer areas and then we overlaid EarthquakesUTM with these areas to find the events located within their borders. The overlay function returns two values: NA if the object is outside the buffer area and 1 if it is inside. We can use this information to subset EarthquakesUTM later on.

Now we can include the overlays in EarthquakesUTM as follows:

EarthquakesUTM$volcano <- as.numeric(volcano.over)
EarthquakesUTM$plates  <- as.numeric(plates.over)
EarthquakesUTM$faults  <- as.numeric(faults.over)

To determine if there is a relation between the distance from each feature and the magnitude of the earthquakes we can simply plot the magnitude's distribution for the various events included in the buffer areas we created before with the following code:

plot(density(EarthquakesUTM[paste(EarthquakesUTM$volcano)=="1",]$mag),ylim=c(0,2),xlim=c(0,10),main="Earthquakes by Origin",xlab="Magnitude")
lines(density(EarthquakesUTM[paste(EarthquakesUTM$faults)=="1",]$mag),col="red")
lines(density(EarthquakesUTM[paste(EarthquakesUTM$plates)=="1",]$mag),col="blue")
legend(3,0.6,title="Mean magnitude per origin",legend=c(paste("Volcanic",round(mean(EarthquakesUTM[paste(EarthquakesUTM$volcano)=="1",]$mag),2)),paste("Faults",round(mean(EarthquakesUTM[paste(EarthquakesUTM$faults)=="1",]$mag),2)),paste("Plates",round(mean(EarthquakesUTM[paste(EarthquakesUTM$plates)=="1",]$mag),2))),pch="-",col=c("black","red","blue"),cex=0.8)

which creates the following plot:

It seems that earthquakes close to plates have higher magnitude on average.

R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics.

↧

IPython Markdown Opportunities in IPython Notebooks and Rstudio

June 6, 2015, 5:56 am

≫ Next: Worrying About my Cholesterol Level

≪ Previous: Cluster analysis on earthquake data from USGS

(This article was first published on OUseful.Info, the blog... » Rstats, and kindly contributed to R-bloggers)

One of the reasons I started working on the Wrangling F1 Data With R book was to see what the Rmd (RMarkdown) workflow was like. Rmd allows you to combine markdown and R code in the same document, as well as executing the code blocks and then displaying the results of that code execution inline in the output document.

As well as rendering to HTML, we can generate markdown (md is actually produced as the interim step to HTML creation), PDF output documents, etc etc.

One thing I’d love to be able to do in the RStudio/RMarkdown environment is include – and execute – Python code. Does a web search to see what Python support there is in R… Ah, it seems it does it already… (how did I miss that?!)

ADDED: Unfortunately, it seems as if Python state is not persisted between separate python chunks – instead, each chunk is run as a one off python inline python command. However, it seems as if there could be a way round this, which is to use a persistent IPython session; and the knitron package looks like just the thing for supporting that.

So that means in RStudio, I could use knitr and Rmd to write a version of Wrangling F1 Data With RPython…

Of course, it would be nicer if I could write such a book in an everyday python environment – such as in an IPython notebook – that could also execute R code (just to be fair;-)

I know that we can already use cell magic to run R in a IPython notebook:

…so that’s that part of the equation.

And the notebooks do already allow us to mix markdown cells and code blocks/output. The default notebook presentation style is to show the code cells with the numbered In []: and Out []: block numbering, but it presumably only takes a small style extension or customisation to suppress that? And another small extension to add the ability to hide a code cell and just display the output?

So what is it that (to my mind at least) makes RStudio a nicer writing environment? One reason is the ability to write the Rmarkdown simply as Rmarkdown in a simple text editor enviroment. Another is the ability to inline R code and display its output in-place.

Taking that second point first, the ability to do better inlining in IPython notebooks – it looks like this is just what the python-markdown extension seems to do:

But how about the ability to write some sort of pythonMarkdown and then open in a notebook? Something like ipymd, perhaps…?

What this seems to do is allow you to open an IPython-markdown document as an IPython notebook (in other words, it replaces the ipynb JSON document with an ipymd markdown document…). To support the document creation aspects better, we just need an exporter that removes the code block numbering and trivially allows code cells to be marked as hidden.

Now I wonder… what would it take to be able to open an Rmd document as an IPython notebook? Presumably just the ability to detect the code language, and then import the necessary magics to handle its execution? It’d be nice if it could cope with inline code, e.g. using the python-markdown magic too?

Exciting times could be ahead:-)

To leave a comment for the author, please follow the link and comment on his blog: OUseful.Info, the blog... » Rstats.

↧

Worrying About my Cholesterol Level

June 9, 2015, 7:56 pm

≫ Next: 15 Easy Solutions To Your Data Frame Problems In R

≪ Previous: IPython Markdown Opportunities in IPython Notebooks and Rstudio

(This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to R-bloggers)

The headline, "Don't Get Wrong Idea About Cholesterol", caught my attention in the 3 May, 2015 Times-Colonist newspaper here in Victoria, B.C.. In fact the article came from a syndicated column, published about a week earlier. No matter - it's always a good time for me to worry about my cholesterol!

The piece was written by a certain Dr. Gifford-Jones (AKA Dr. Ken Walker).

Here's part of what he had to say:

"Years ago, Dr. John Judkin, formerly emeritus professor of physiology at the University of London, was ridiculed after after he reported that a high dietary intake of animal fat and the eating of foods containing cholesterol were not the cause of coronary heart disease.

But Judkin pointed to a greater correlation between the intake of sucrose (ordinary sugar) and coronary attack. For instance a study in 15 countries showed that as the population consumed more sugar, there was a dramatic increase in heart attacks.

More impressive is a prison study by Milton Winitz, a U.S. biochemist, in 1964. Eighteen prisoners, kept behind bars for six months, were given food that was regulated. In this controlled environment, it was proven that when the prisoner diet was high in sugar, blood cholesterol increased and when dietary sugar was decreased there was a huge drop in blood cholesterol."

I've got nothing against the good doctor, but you'll notice that I've highlighted a few key words in the material quoted above. I'm sure I don't need to explain why!

What he's referring to is research reported by Winitz and his colleagues in the 1964 paper, "The effect of dietary carbohydrate on serum cholesterol levels" (Archives of Biochemistry and Biophysics, 108, 576-579). Interestingly, the findings outlined in that paper were a by-product of the main research that was being undertaken with NASA sponsorship - research into the development of diets for astronauts!

In his famous book, How to Live Longer and Feel Better, the Nobel laureate Linus Pauling refers to this study by Winitz et al.:

"These investigators studied 18 subjects, who were kept in a locked institution, without access to other food, during the whole period of study (about 6 months).

After a preliminary period with ordinary food, they were placed on a chemically well-defined small molecule diet (seventeen amino acids, a little fat, vitamins, essential minerals, and glucose as the only carbohydrate).

The only significant physiological change that was found was in the concentration of cholesterol in the blood serum, which decreased rapidly for each of the 18 subjects.

The average concentration in the initial period, on ordinary food, was 227 milligrams per deciliter. After two weeks on the glucose diet it had dropped to 173, and after another two weeks it was 160.

The diet was then changed by replacing one quarter of the glucose with sucrose, with all the other dietary constituents the same. Within one week the average cholesterol concentration had risen from 160 to 178, and after two more weeks to 208.

The sucrose was then replaced by glucose. Within one week the average cholesterol concentration had dropped to 175, and it continued dropping , leveling off at 150, 77 less than the initial value." (p.42)

Does any of this constitute proof? Of course not!

But let's take a look at the actual data and undertake our own statistical analysis of this very meagre set of information. From Winitz et al., p.577, we have:

(The data are available in a text file on the data page for this blog. As you can see, the sample size is extremely small - only 18 people were "treated".)

The authors summarize their key findings as follows (p. 578):

"On the basis of a statistical analysis of the mean values of the serum cholesterol levels at the end of the 4th, 7th, 8th, and 19th weeks (see Table I), the following conclusions were drawn (95% confidence level): (a) each of the two progressive decreases in serum cholesterol level with the diet containing glucose as the sole sugar is statistically significant, and (b) the progressive increase in serum cholesterol level upon partial substitution of the glucose with sucrose is also statistically significant."

Interestingly, exactly what they mean by, "a statistical analysis", is not explained!

(Don't try getting away with that lack of specificity, kids!)

So, the claim is that there's a significant difference between the "before treatment" and "after treatment" results. I must admit to being somewhat skeptical about this, given the tiny number of candidates being treated. However, let's keep an open mind.

Crucially, we're not told by the researchers what statistical tests were actually performed!

However, given that we have a group of people with "before treatment" and "after treatment" scenarios, paired t-tests for the equality of the means provide a natural way to proceed. (e.g., see here.) This requires that we have simple random sampling, and that the population is Normal. More specifically, the differences between the "before" and "after" data have to be normal.

Just for funzies, let's use a variety of resources in our statistical analysis

My EViews workfile can be found on the code page for this blog. Take a look at the "README" text object in that file for some details of what I did.

Some basic Q-Q plots support the normality assumption for the data differences. Here's just one typical example - it's for the difference ("D67") between the "Week 7" and "Week 6" data in Table 1, above:

Moreover, the Anderson-Darling test for normality (with a sample size adjustment) produces p-values ranging from 0.219 to 0.925. This is very strong support for the normality assumption, especially given the known desirable power properties of this test.

The assumption of random sampling is somewhat more problematic. Were the 18 inmates selected randomly from the population at large? I don't think so! There's nothing we can about this, except to be cautious if we try to extrapolate the results of the study to more general population. Which is, of course, what Dr. Gifford-Jones and others are doing.

Now, what about the results for the paired t-tests themselves?

Here they are, with p-values in parentheses. The naming of the t-statistics and their p-values follows the column headings in Table 1, above, with "1½" and "2½" abbreviated to "1" and "2" respectively. For example, "t02" and "p02" refer to the test between the "0 Weeks" and "2 Weeks" data. Similarly, "t819" and "p819" refer to the test between the "8 Weeks" and "19 Weeks" data, etc.

Table 2

Phase I Phase II Phase III

t01 t02 t04 t12 t14 t24 t56 t57 t67 t819

-7.92 -7.33 -9.82 -1.10 -2.88 - 2.21 4.92 7.54 3.71 -4.12
(0.00) (0.00) (0.00) (0.15) (0.01) (0.02) ( 0.00) (0.00) (0.00) (0.00)

Take a look back at footnote "b" in Table 1 above. You'll see that negative t-statistics are expected everywhere in Table 2 except during "Phase II" of the trials, if we believe the hypothesis that (prolonged) high sucrose intakes are associated with high cholesterol levels.

In all but one instance, the paired t-tests give results that are significant at the 5% level.

Now, it's all very well to have obtained these results, but we might ask - "How powerful is the paired t-test when we're working with such small samples?"

To answer this question I decided to adapt some code that uses the "pwr" package in R, kindly provided by Jake Westfall on the Cross Validated site. The code requires the value of the so-called "effect size", which I computed for our data to be equal to 1, using this online resource. The "tweaked" R code that I used is available on my code page.

The for a particular sample correlation between the paired data, the code computes the (minimum) number of pairs needed for a paired t-test to have a desired power when the significance level is 5%.
:

Table 3

The pair-wise sample correlations in the data set we're examining (the relevant columns in Table 1) range between 0.696 and 0.964. So, in Table 3, it turns out that even for the sample sizes that we have, the powers of the paired t-tests are actually quite respectable. For example, the sample correlation for the data for Weeks 1 and 2 is 0.898, so a sample size of at least 5 is needed for the test of equality of the corresponding means to have a power of 99%. This is for a significance level of 5%. This minimum sample size increases to 6 if the significance level is 1% - you can re-run the R code to verify this.

At the end of the day, the small number of people included in the experiment was probably not a big problem. However, don't forget that (questionable) assumption of independent sampling.

In any case, I'm going to cut back on my sugar intake and get more exercise!

To leave a comment for the author, please follow the link and comment on his blog: Econometrics Beat: Dave Giles' Blog.

↧

15 Easy Solutions To Your Data Frame Problems In R

June 11, 2015, 12:44 am

≫ Next: Shiny Wool Skeins

≪ Previous: Worrying About my Cholesterol Level

(This article was first published on The DataCamp Blog » R, and kindly contributed to R-bloggers)

R’s data frames regularly create somewhat of a furor on public forums like Stack Overflow and Reddit. Starting R users often experience problems with the data frame in R and it doesn’t always seem to be straightforward. But does it really need to be so?

Well, not necessarily.

With today’s post, DataCamp wants to show you that data frames don’t need to be hard: we offer you 15 easy, straightforward solutions to the most frequently occurring problems with data.frame. These issues have been selected from the most recent and sticky or upvoted Stack Overflow posts. If, however, you are more interested in getting an elaborate introduction to data frames, you might consider taking a look at our Introduction to R course.

The Root: What’s A Data Frame?

R’s data frames offer you a great first step by allowing you to store your data in overviewable, rectangular grids. Each row of these grids corresponds to measurements or values of an instance, while each column is a vector containing data for a specific variable.

This means that a data frame’s rows do not need to contain, but can contain, the same type of values: they can be numeric, character, logical, etc.; As you can see in the data frame below, each instance, listed in the first unnamed column with a number, has certain characteristics that are spread out over the remaining three columns. Each column needs to consist of values of the same type, since they are data vectors: as such, the breaks column only contains numerical values, while the wool and tension columns have characters as values that are stored as factors.

In case you’re wondering, this data frame lists the number of breaks in yarn during weaving.

Remember that factors are variables that can only contain a limited number of different values. As such, they are often called categorical variables.

head(warpbreaks)

##   breaks wool tension
## 1     26    A       L
## 2     30    A       L
## 3     54    A       L
## 4     25    A       L
## 5     70    A       L
## 6     52    A       L

Maybe you will have already noticed that data frames ressemble matrices, except for the fact that their data values don’t need to be of the same type, while matrices do require this. Data frames also have similarities with lists, which are basically collections of components. A data frame, however, is a list with vector structures of the same length. As such, data frames can actually be seen as special types of lists and can be accessed as either a matrix or a list.

If you want more information or if you just want to review and take a look at a comparison of the five general data structures in R, watch the small video below:

As you can see, there are different data structures that impose different requirements on how the data is stored. Data frames are handy to store multiple data vectors, which makes it easier to organize your data, to apply functions to it and to save your work. It’s almost similar to having a single spreadsheet with elements that all have equal lengths!

The Basics Of Data Frames: The Questions And Solutions

How To Create A Simple Data Frame in R

Even though looking at built-in data frames such as esoph is interesting, it can easily get more exciting!

How?

By making your own data frame in R, of course! You can do this very easily by making some vectors first:

Died.At <- c(22,40,72,41)
Writer.At <- c(16, 18, 36, 36)
First.Name <- c("John", "Edgar", "Walt", "Jane")
Second.Name <- c("Doe", "Poe", "Whitman", "Austen")
Sex <- c("MALE", "MALE", "MALE", "FEMALE")
Date.Of.Death <- c("2015-05-10", "1849-10-07", "1892-03-26","1817-07-18")

Next, you just combine the vectors that you made with the data.frame() function:

writers_df <- data.frame(Died.At, Writer.At, First.Name, Second.Name, Sex, Date.Of.Death)

Remember that data frames must have variables of the same length. Check if you have put an equal number of arguments in all c() functions that you assign to the vectors and that you have indicated strings of words with "".

Note that when you use the data.frame() function, character variables are imported as factors or categorical variables. Use the str() function to get to know more about your data frame.

str(writers_df)

## 'data.frame':    4 obs. of  6 variables:
##  $ Died.At      : num  22 40 72 41
##  $ Writer.At    : num  16 18 36 36
##  $ First.Name   : Factor w/ 4 levels "Edgar","Jane",..: 3 1 4 2
##  $ Second.Name  : Factor w/ 4 levels "Austen","Doe",..: 2 3 4 1
##  $ Sex          : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 1
##  $ Date.Of.Death: Factor w/ 4 levels "1817-07-18","1849-10-07",..: 4 2 3 1

Note that if you’re more interested in inspecting the first and the last lines of your data frame, you can use the head() and tail() functions, respectively.

You see that the First.Name, Second.Name, Sex and Date.Of.Death variables of the writers_df data frame have all been read in as factors. But do you want this?

For the variables First.Name and Second.Name, you don’t want this. You can use the I() function to insulate them. This function inhibits the interpretation of its arguments. In other words, by just slightly changing the definitions of the vectors First.Name and Second.Name with the addition of the I() function, you can make sure that the proper names are not interpreted as factors.
You can keep the Sex vector as a factor, because there are only a limited amount of possible values that this variable can have.
Also for the variable Date.of.Death you don’t want to have a factor. It would be better if the values are registered as dates. You can add the as.Date() function to this variable to make sure this happens.

Died.At <- c(22,40,72,41)
Writer.At <- c(16, 18, 36, 36)
First.Name <- I(c("John", "Edgar", "Walt", "Jane"))
Second.Name <- I(c("Doe", "Poe", "Whitman", "Austen"))
Sex <- c("MALE", "MALE", "MALE", "FEMALE")
Date.Of.Death <- as.Date(c("2015-05-10", "1849-10-07", "1892-03-26","1817-07-18"))
writers_df <- data.frame(Died.At, Writer.At, First.Name, Second.Name, Sex, Date.Of.Death)
str(writers_df)

## 'data.frame':    4 obs. of  6 variables:
##  $ Died.At      : num  22 40 72 41
##  $ Writer.At    : num  16 18 36 36
##  $ First.Name   :Class 'AsIs'  chr [1:4] "John" "Edgar" "Walt" "Jane"
##  $ Second.Name  :Class 'AsIs'  chr [1:4] "Doe" "Poe" "Whitman" "Austen"
##  $ Sex          : Factor w/ 2 levels "FEMALE","MALE": 2 2 2 1
##  $ Date.Of.Death: Date, format: "2015-05-10" "1849-10-07" ...

If you use other functions such as read.table() or other functions that are used to input data, such as read.csv() and read.delim(), a data frame is returned as the result. This way, files that look like this one below or files that have other delimiters, will be converted to data frames once they are read into R with these functions.

22, 16, John, Doe, MALE, 2015-05-10
40, 18, Edgar, Poe, MALE, 1849-10-07
72, 36, Walt, Whitman, MALE, 1892-03-26
41, 36, Jane, Austen, FEMALE, 1817-07-18

If you want to know more about how you can read and import Excel files into R, make sure to check out our tutorial! Alternatively, you could check out the Rdocumentation page on read.table.

How To Change A Data Frame’s Row And Column Names

Data frames can also have a names attribute, by which you can see the names of the variables that you have included into your data frame. In other words, you can also set the header for your data frame. You already did this before when making the data frame object writers_df; You see that the names of the variables Died.At, Writer.At, First.Name, Second.Name, Sex and Date.Of.Death appear:

writers_df

##   Died.At Writer.At First.Name Second.Name    Sex Date.Of.Death
## 1      22        16       John         Doe   MALE    2015-05-10
## 2      40        18      Edgar         Poe   MALE    1849-10-07
## 3      72        36       Walt     Whitman   MALE    1892-03-26
## 4      41        36       Jane      Austen FEMALE    1817-07-18

You can also retrieve the names with the names() function:

names(writers_df)

## [1] "Died.At"       "Writer.At"     "First.Name"    "Second.Name"  
## [5] "Sex"           "Date.Of.Death"

Now that you see the names of your data frame, you’re not so sure if these are efficient or correct. To change the names that appear, you can easily continue using the names() function. Make sure, though, that you have a number of arguments in the c() function that is equal to the number of variables that you have included into your data frame. In this case, since there are six variables Died.At, Writer.At, First.Name, Second.Name, Sex and Death, you want six arguments in the c() function. Otherwise, the other variables will be interpreted as “NA”.

Note also how the arguments of the c() function are inputted as strings!

names(writers_df) <- c("Age.At.Death", "Age.As.Writer", "Name", "Surname", "Gender", "Death")
names(writers_df)

## [1] "Age.At.Death"  "Age.As.Writer" "Name"          "Surname"      
## [5] "Gender"        "Death"

Tip: try to leave out the two last arguments from the c() function and see what happens!

Note that you can also access and change the column and row names of your data frame with the functions colnames() and rownames(), respectively:

colnames(writers_df) = c("Age.At.Death", "Age.As.Writer", "Name", "Surname", "Gender", "Death")
rownames(writers_df) = c("ID1", "ID2", "ID3", "ID4")

How To Check A Data Frame’s Dimensions

As you know, the data frame is similar to a matrix, which means that its size is determined by how many rows and columns you have combined into it. To check how many rows and columns you have in your data frame, you can use the dim() function:

dim(writers_df)

## [1] 4 6

The result of this function is represented as [1] 4 6. Just like a matrix, the data frame’s dimensions are defined by the number of rows, followed by the number of columns. If you are in doubt, you can check your numbers through a comparison with the original data frame!

Note that you can also just retrieve the number of rows or columns by entering

dim(writers_df)[1] #Number of rows
dim(writers_df)[2] #Number of columns

or by using the functions nrow() and ncol(), to retrieve the number of rows or columns, respectively:

nrow(writers_df) 
ncol(writers_df)

Since the data frame structure is also similar to a list, you could also use the length() function to retrieve the number of rows:

length(writers_df)

How To Access And Change A Data Frame’s Values

…. Through The Variable Names

Now that we have retrieved and set the names of our data frame, we want to take a closer look at the values that are actually stored in it. There are two straightforward ways that you can access these values. First, you can try to access them by just entering the data frame’s name in combination with the variable name:

writers_df$Age.As.Writer

Note that if you change one of the values in the vector Age that this change will not be incorporated into the data frame:

Writer.At[1]=2
writers_df

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  John     Doe   MALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18

In the end, with this method of accessing the values, you just create a copy of a certain variable! That’s why any changes to the variables do not change the data frame’s variables.

… Through The [,] and $ Notations

You can also access the data frame’s values by using the [,] notation:

writers_df [1:2,3] #Value located on the first and second row, third column

gives

## [1] "John"  "Edgar"

writers_df[, 3] #Values located in the third column

gives

## [1] "John"  "Edgar" "Walt"  "Jane"

writers_df[3,] #Values located in the third row

gives

##   Age.At.Death Age.As.Writer Name Surname Gender      Death
## 3           72            36 Walt Whitman   MALE 1892-03-26

Remember that data frames’ dimensions are defined as rows by columns.

An alternative to the [,] notation is a notation with $:

writers_df$Age.At.Death

gives

## [1] 22 40 72 41

writers_df$Age.At.Death[3] #Value located on third row of the column `Age.At.Death`

gives

## [1] 72

Note that you can also change the values of your data frame by simply using these notations to perform mathematical operations:

writers_df$Age.At.Death <- writers_df$Age.At.Death-1
writers_df[,1] <- writers_df$Age.At.Death-1

If you really want to make your hands dirty some more and change some of the data frame’s values, you can use the [,] notation to actually change the values inside your data frame one by one:

writers_df[1,3] = "Jane"
writers_df[1,5] = "FEMALE"
writers_df

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  Jane     Doe FEMALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18

Why And How To Attach Data Frames

The $ notation is pretty handy, but it can become very annoying when you have to type it each time that you want to work with your data. The attach() function offers a solution to this: it takes a data frame as an argument and places it in the search path at position 2. So unless there are variables in position 1 that are exactly the same as the ones from the data frame you have inputted, the variables from your data frame are considered as variables that can be immediately called on.

Note that the search path is in fact the order in which R accesses files. You can look this up by entering the search() function:

search()

##  [1] ".GlobalEnv"         "package:knitr"      "package:RWordPress"
##  [4] "package:REmails"    "package:RJSONIO"    "package:httr"      
##  [7] "writers_df"         "env:itools"         "package:data.table"
## [10] "package:RDatabases" "package:RMySQL"     "package:DBI"       
## [13] "package:yaml"       "package:dplyr"      "tools:rstudio"     
## [16] "package:stats"      "package:graphics"   "package:grDevices" 
## [19] "package:utils"      "package:datasets"   "package:methods"   
## [22] "Autoloads"          "package:base"

attach(writers_df)

## The following objects are masked _by_ .GlobalEnv:
## 
##     Age.As.Writer, Age.At.Death, Death, Gender, Name, Surname

writers_df

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  Jane     Doe FEMALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18

Note that you can alternatively use the with() function to attach the data frame, but that this requires you to specify some more arguments:

with(writers_df, c("Age.At.Death", "Age.As.Writer", "Name", "Surname", "Gender", "Death"))

## [1] "Age.At.Death"  "Age.As.Writer" "Name"          "Surname"      
## [5] "Gender"        "Death"

writers_df

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  Jane     Doe FEMALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18

You can now safely execute the following command and you can actually access/change the values of all the data frame’s variables:

Age.At.Death

## [1] 22 40 72 41

Age.At.Death <- Age.At.Death-1
Age.At.Death

If you get an error that tells you that “The following objects are masked by .GlobalEnv:”, this is because you have objects in your global environment that have the same name as your data frame. Those objects could be the vectors that you created above, if you didn’t change their names. You have two solutions to this:

You just don’t create any objects with those names in your global environment. This is more a solution for those of you who imported their data through read.table(), read.csv() or read.delim(), but not really appropriate for this case.
You rename the objects in the data frame so that there’s no conflict. This is the solution that was applied in this tutorial. So, rename your columns with the names() or colnames() functions.

Note that if all else fails, you can just remember to always refer to your data frame’s column names with the $ notation!

How To Apply Functions To Data Frames

Now that you have successfully made and modified your data frame by putting a header in place, you can start applying functions to it! In some cases where you want to calculate stuff, you might want to put the numeric data in a separate data frame:

Ages <- writers_df[,1:2]

Only then can you start to get, for example, the mean and the median of your numeric data. You can do this with the apply() function. The first argument of this function should be your smaller data frame, in this case, age. The second argument designates what data you want to consider for the calculations of the mean or median: columns or rows. In this case, we want to calculate the median and mean of the variables Age.At.Death and Age.As.Writer, which designate columns in the data frame. The last argument then specifies the exact calculations that you want to do on your data:

apply(Ages, 2, median)

##  Age.At.Death Age.As.Writer 
##          40.5          27.0

apply(Ages,1,median)

## [1] 19.0 29.0 54.0 38.5

apply(Ages, 2, mean)

##  Age.At.Death Age.As.Writer 
##         43.75         26.50

Do you want to know more about the apply() function and how to use it? Check out our Intermediate R course, which teaches you, amongst other things, how to make your R code more efficient and readable using this function.

Surpassing The Data Frame Basics: More Questions, More Answers

Now that you have been introduced to the basic pitfalls of data frames, it’s time to look at some problems, questions or difficulties that you might have while working with data frames more intensively.

How To Create An Empty Data Frame

The easiest way to create an empty data frame is probably by just assigning a data.frame() function without any arguments to a vector:

ab <- data.frame()
ab

## data frame with 0 columns and 0 rows

You can then start filling your data frame up by using the [,] notation. Be careful, however, because it’s easy to make errors while doing this!

Note how you don’t see any column names in this empty data set. If you do want to have those, you can just initialize empty vectors in your data frame, like this:

Age <- numeric()
Name <- character()
ID <- integer()
Gender <- factor()
Date <- as.Date(character())
ab <- data.frame(c(Age, Name, ID, Gender, Date))
ab

## [1] c.Age..Name..ID..Gender..Date.
## <0 rows> (or 0-length row.names)

How To Extract Rows And Colums, Subseting Your Data Frame

Subsetting or extracting specific rows and columns from a data frame is an important skill in order to surpass the basics that have been introduced in step two, because it allows you to easily manipulate smaller sets of your original data frame. You basically extract those values from the rows and columns that you need in order to optimize the data analyses you make.

It’s easy to start subsetting with the [,] notation that was described in step two:

writer_names_df <- writers_df[1:4, 3:4]
writer_names_df

##    Name Surname
## 1  Jane     Doe
## 2 Edgar     Poe
## 3  Walt Whitman
## 4  Jane  Austen

Note that you can also define this subset with the variable names:

writer_names_df <- writers_df[1:4, c("Name", "Surname")]

Tip: be careful when you are subsetting just one column!

R has the tendency to simplify your results, which means that it will read your subset as a vector, which normally, you don’t want to get. To make sure that this doesn’t happen, you can add the argument drop=FALSE:

writer_names_df <- writers_df[1:4, "Name", drop=FALSE]
str(writer_names_df)

## 'data.frame':    4 obs. of  1 variable:
##  $ Name:Class 'AsIs'  chr [1:4] "Jane" "Edgar" "Walt" "Jane"

In a next step, you can try subsetting with the subset() function:

writer_names_df <- subset(writers_df, Age.At.Death <= 40 & Age.As.Writer >= 18)
writer_names_df

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 2           40            18 Edgar     Poe   MALE 1849-10-07

You can also subset on a particular value:

writer_names_df <- subset(writers_df, Name =="Jane")
writer_names_df

##   Age.At.Death Age.As.Writer Name Surname Gender      Death
## 1           22            16 Jane     Doe FEMALE 2015-05-10
## 4           41            36 Jane  Austen FEMALE 1817-07-18

You can not only subset with the R functions that have been described above. You can also turn to grep() to get the job done. For example, if you want to work with the rows in the column Age.At.Death that have values that contain “4”, you can use the following line of code:

fourty_writers <- writers_df[grep("4", writers_df$Age.At.Death),]
fourty_writers

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 4           41            36  Jane  Austen FEMALE 1817-07-18

Note that by subsetting, you basically stop considering certain values of your data frame. This might mean that you remove certain features of a factor, by, for example, only considering the MALE members of the writers_df data frame. Notice how all factor levels of this column still remain present, even though you have created a subset:

male_writers <- writers_df[Gender =="MALE",]
str(male_writers)

## 'data.frame':    0 obs. of  6 variables:
##  $ Age.At.Death : num 
##  $ Age.As.Writer: num 
##  $ Name         :Class 'AsIs'  chr(0) 
##  $ Surname      :Class 'AsIs'  chr(0) 
##  $ Gender       : Factor w/ 2 levels "FEMALE","MALE": 
##  $ Death        :Class 'Date'  num(0)

To remove the factor levels that are no longer present, you can enter the following line of code:

factor(Gender)

## factor(0)
## Levels:

How To Remove Columns And Rows From A Data Frame

If you want to remove values or entire columns from your data frame, you can assign a NULL value to the desired unit:

writers_df[1,3] <- NULL
Age.At.Death <- NULL

To remove rows, the procedure is a bit more complicated. You define a new vector in which you list for every row whether to have it included or not. Then, you apply this vector to your data frame:

rows_to_keep <- c(TRUE, FALSE, TRUE, FALSE)
limited_writers_df <- writers_df[rows_to_keep,]
limited_writers_df

##   Age.At.Death Age.As.Writer Name Surname Gender      Death
## 1           22            16 Jane     Doe FEMALE 2015-05-10
## 3           72            36 Walt Whitman   MALE 1892-03-26

Note that you can also do the opposite by just adding !, stating that the reverse is true:

less_writers_df <- writers_df[!rows_to_keep,]
less_writers_df

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 4           41            36  Jane  Austen FEMALE 1817-07-18

You can also work with thresholds. For example, you can specify that you only want to keep all writers that were older than fourty when they died:

fourty_sth_writers <- writers_df[writers_df$Age.At.Death > 40,]
fourty_sth_writers

##   Age.At.Death Age.As.Writer Name Surname Gender      Death
## 3           72            36 Walt Whitman   MALE 1892-03-26
## 4           41            36 Jane  Austen FEMALE 1817-07-18

How To Add Rows And Columns To A Data Frame

Much in the same way that you used the [,] and $ notations to access and change single values of your data frame, you can also easily add columns to your data frame:

writers_df$Location <- c("Belgium", "United Kingdom", "United States", "United Kingdom")

Appending rows to an existing data frame is somewhat more complicated. To easily do this by first making a new row in a vector, respecting the column variables that have been defined in writers_df and by then binding this row to the original data frame with the rbind() function:

new_row <- c(50, 22, "Roberto", "Bolano", "MALE", "2003-07-15")
writers_df_large <- rbind(writers_df, new_row)

Why And How To Reshape A Data Frame From Wide To Long Format And Vice Versa

When you have multiple values, spread out over multiple columns, for the same instance, your data is in the “wide” format. On the other hand, when your data is in the “long” format if there is one observation row per variable. You therefore have multiple rows per instance. Let’s illustrate this with an example. Long data looks like this:

Subject <- c(1,2,1,2,2,1)
Gender <- c("M", "F", "M", "F", "F","M")
Test <- c("Read", "Write", "Write", "Listen", "Read", "Listen")
Result <- c(10, 4, 8, 6, 7, 7)
observations_long <- data.frame(Subject, Gender, Test, Result)
observations_long

##   Subject Gender   Test Result
## 1       1      M   Read     10
## 2       2      F  Write      4
## 3       1      M  Write      8
## 4       2      F Listen      6
## 5       2      F   Read      7
## 6       1      M Listen      7

As you can see, there is one row for each value that you have in the Test variable. A lot of statistical tests favor this format.

This data frame would look like the following in the wide format:

Subject <- c(1,2)
Gender <- c("M", "F")
Read <- c(10, 7)
Write <-c(8, 4)
Listen <- c(7, 6)
observations_wide <- data.frame(Subject, Gender, Read, Write, Listen)
observations_wide

##   Subject Gender Read Write Listen
## 1       1      M   10     8      7
## 2       2      F    7     4      6

You see that each column represents a unique pairing of the various factors with the values.

Since different functions may require you to input your data either in “long” or “wide” format, you might need to reshape your data set. There are two main options that you can choose here: you can use the stack() function or you can try using the reshape() function. The former is preferred when you work with simple data frames, while the latter is more often used on more complex data frames, mostly because there’s a difference in the possibilities that both functions offer.

Make sure to keep on reading to know more about the differences in possibilities between the stack() and reshape() functions!

Using `stack()` For Simply Structured Data Frames

The stack() function basically concatenates or combines multiple vectors into a single vector, along with a factor that indicates where each observation originates from.

To go from wide to long format, you will have to stack your observations, since you want one observation row per variable, with multiple rows per variable. In this case, you want to merge the columns Read, Write and Listen together, qua names and qua values:

long_format <- stack(observations_wide, 
                     select=c(Read, 
                              Write, 
                              Listen))
long_format

##   values    ind
## 1     10   Read
## 2      7   Read
## 3      8  Write
## 4      4  Write
## 5      7 Listen
## 6      6 Listen

To go from long to wide format, you will need to unstack your data, which makes sense because you want to have one row per instance with each value present as a different variable. Note here that you want to disentangle the Result and Test columns:

wide_format <- unstack(observations_long, 
                       Result ~ Test)
wide_format

##   Listen Read Write
## 1      6   10     4
## 2      7    7     8

Using `reshape()` For Complex Data Frames

This function is part of the stats package. This function is similar to the stack() function, but is a little bit more elaborate. Read and see for yourself how reshaping your data works with the reshape() function:

To go from a wide to a long data format, you can first start off by entering the reshape() function. The first argument should always be your original wide data set. In this case, you can specify that you want to input the observations_wide to be converted to a long data format.

Then, you start adding other arguments to the reshape() function:

Include a list of variable names that define the different measurements through varying. In this case, you store the scores of specific tests in the columns “Read”, “Write” and “Listen”.
Next, add the argumentv.names to specify the name that you want to give to the variable that contains these values in your long dataset. In this case, you want to combine all scores for all reading, writing and listening tests into one variable Score.
You also need to give a name to the variable that describes the different measurements that are inputted with the argument timevar. In this case, you want to give a name to the column that contains the types of tests that you give to your students. That’s why this column’s name should be called “Test”.
Then, you add the argument times, because you need to specify that the new column “Test” can only take three values, namely, the test components that you have stored: “Read”, “Write”, “Listen”.
You’re finally there! Give in the end format for the data with the argument direction.
Additionally, you can specify new row names with the argument new.row.names.

Tip: try leaving out this last argument and see what happens!

library(stats)
long_reshape <- reshape(observations_wide, 
             varying = c("Read", "Write", "Listen"), 
             v.names = "Score",
             timevar = "Test", 
             times = c("Read", "Write", "Listen"),
             direction = "long",
             new.row.names = 1:1000)
long_reshape

##   Subject Gender   Test Score id
## 1       1      M   Read    10  1
## 2       2      F   Read     7  2
## 3       1      M  Write     8  1
## 4       2      F  Write     4  2
## 5       1      M Listen     7  1
## 6       2      F Listen     6  2

From long to wide, you take sort of the same steps. First, you take the reshape() function and give it its first argument, which is the data set that you want to reshape. The other arguments are as follows:

timevar allows you to specify that the variable Test, which describes the different tests that you give to your students, should be decomposed.
You also specify that the reshape() function shouldn’t take into account the variables Subject and Gender of the original data set. You put these column names into idvar.
By not naming the variable Result, the reshape() function will know that both Test and Result should be recombined.
You specify the direction of the reshaping, which is in this case, wide!

wide_reshape <- reshape(observations_long, 
                        timevar = "Test",
                        idvar = c("Subject", "Gender"),
                        direction = "wide")
wide_reshape

##   Subject Gender Result.Read Result.Write Result.Listen
## 1       1      M          10            8             7
## 2       2      F           7            4             6

Note that if you want you can also rename or sort the results of these new long and wide data formats! You can find detailed instructions below.

Reshaping Data Frames With `tidyr`

This package allows you to “easily tidy data with the spread() and gather() functions” and that’s exactly what you’re going to do if you use this package to reshape your data!

If you want to convert from wide to long format, the principle stays similar to the one that of reshape(): you use the gather() function and you start specifying its arguments:
1. Your data set is the first argument to the gather() function
2. Then, you specify the name of the column in which you will combine the the values of Read, Write and Listen. In this case, you want to call it something like Test or Test.Type.
3. You enter the name of the column in which all the values of the Read, Write and Listen columns are listed.
4. You indicate which columns are supposed to be combined into one. In this case, that will be the columns from Read, to Listen.

library(tidyr)

long_tidyr <- gather(observations_wide, 
                     Test, 
                     Result, 
                     Read:Listen)

long_tidyr

##     Subject Gender   Test Result
## 1       1      M   Read     10
## 2       2      F   Read      7
## 3       1      M  Write      8
## 4       2      F  Write      4
## 5       1      M Listen      7
## 6       2      F Listen      6

Note how this the last argument specifies the columns in the same way as you did to subset your data frame or to select your data frame’s columns in which you wanted to perform mathematical operations. You can also just specify the columns individually like this:

long_tidyr <- gather(observations_wide, 
                     Test, 
                     Result, 
                     Read, 
                     Write, 
                     Listen)

The opposite direction, from long to wide format, is very similar to the function above, but this time with the spread() function:

library(tidyr)

wide_tidyr <- spread(observations_long, 
                     Test, 
                     Result)

wide_tidyr

##    Subject Gender Listen Read Write
## 1       1      M      7   10     8
## 2       2      F      6    7     4

Again, you take as the first argument your data set. Then, you specify the column that contains the new column names. In this case, that is Test. Lastly, you input the name of the column that contains the values that should be put into the new columns.

Tip: take a look at the “Data Wrangling With dplyr And tidyr Cheat Sheet” for a complete overview of the possibilities that these packages can offer you to wrangle your data!

Reshaping Data Frames With `reshape2`

This package, which allows you to “flexibly reshape data”, actually has very straightforward ways of reshaping your data frame.

To go from a wide to a long data format, you use the melt() function. This function is pretty easy, since it just takes your data set and the id.vars argument, which you may already know from the reshape() function. This argument allows you to specify which columns should be left alone by the function.

library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

long_reshaped2 <- melt(observations_wide, 
                       id.vars=c("Subject", "Gender"))
long_reshaped2

##   Subject Gender variable value
## 1       1      M     Read    10
## 2       2      F     Read     7
## 3       1      M    Write     8
## 4       2      F    Write     4
## 5       1      M   Listen     7
## 6       2      F   Listen     6

Note that this function allows you to specify a couple more arguments:

library(reshape2)
long_reshaped2 <- melt(observations_wide, 
                       id.vars=c("Subject", "Gender"),
                       measure.vars=c("Read", "Write", "Listen"),
                       variable.name="Test",
                       value.name="Result")
long_reshaped2

##   Subject Gender   Test Result
## 1       1      M   Read     10
## 2       2      F   Read      7
## 3       1      M  Write      8
## 4       2      F  Write      4
## 5       1      M Listen      7
## 6       2      F Listen      6

measure.vars is there to name the destination column that will combine the original columns. If you leave out this argument, the melt() function will use all other variables as the id.vars.
variable.name specifies how you want to name that destination column. If you don’t specify this argument, you will have a column named “variable” in your result.
value.name allows you to input the name of the column in which the values or test results will be stored. If you leave out this argument, this column will be named “measurement”.

You can also go from a long to a wide format with the reshape2 package with the dcast() function. This is fairly easy: you first give in your data set, as always. Then, you combine the columns which you don’t want to be touched; In this case, you want to keep Subject and Gender as they are. The column Test however, you want to split! So, that is the second part of your second argument, indicated by a ~. The last argument of this function is value.var, which holds the values of the different tests. You want to name this new column Result:

library(reshape2)
long_reshaped2 <- dcast(observations_long, 
                        Subject + Gender ~ Test, 
                        value.var="Result")
long_reshaped2

##   Subject Gender Listen Read Write
## 1       1      M      7   10     8
## 2       2      F      6    7     4

How To Sort A Data Frame

Sorting a data frame by columns might seem tricky, but this can be made easy by either using R’s built-in order() function or by using a package.

R’s Built-In `Order()` Function

You can for example sort by one of the dataframe’s columns. You order the rows of the data frame according to the values that are stored in the variable Age.As.Writer:

writers_df[order(Age.As.Writer),]

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  Jane     Doe FEMALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18

If you want to sort the values starting from high to low, you can just add the extra argument decreasing, which can only take logical values.

Remember that logical values are TRUE or FALSE, respectively.

Another way is to add the function rev() so that it includes the order() function. As the function’s name suggests, it provides a way to give you the reversed version of its argument, which is order(Name) in this case:

writers_df[order(Age.As.Writer, decreasing=TRUE),]

writers_df[rev(order(Age.As.Writer)),]

You can also add a - in front of the numeric variable that you have given to order on.

writers_df[order(-Age.As.Writer),]

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 3           72            36  Walt Whitman   MALE 1892-03-26
## 4           41            36  Jane  Austen FEMALE 1817-07-18
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 1           22            16  Jane     Doe FEMALE 2015-05-10

Sorting With `dplyr`

The dplyr package, known for its abilities to manipulate data, has a specific function that allows you to sort rows by variables.

Dplyr’s function to make this happen is arrange(). The first argument of this function is the data set that you want to sort, while the second and third arguments are the variables that you choose to sort. In this case we sort first on the variable Age.At.Death and then on Age.As.Writer.

data2 <- arrange(writers_df, Age.At.Death, Age.As.Writer)
data2

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           22            16  Jane     Doe FEMALE 2015-05-10
## 2           40            18 Edgar     Poe   MALE 1849-10-07
## 3           41            36  Jane  Austen FEMALE 1817-07-18
## 4           72            36  Walt Whitman   MALE 1892-03-26

You can also use the following approach to get the same result:

writers_df[with(writers_df, order(Age.At.Death, Age.As.Writer)), ]

If you want to sort these columns in descending order, you can add the function desc() to the variables:

desc_sorted_data <- arrange(writers_df, desc(Age.At.Death))

Interested in doing much more with the dplyr package? Check out our Data Manipulation in R with dplyr course, which will teach you how to to perform sophisticated data manipulation tasks using dplyr! Also, don’t forget to look at the “Data Wrangling With dplyr And tidyr Cheat Sheet”!

How To Merge Data Frames

Merging Data Frames On Column Names

You can use the merge() function to join two, but only two, data frames. Let’s say we have a data frame data2, which has the same values stored in a variable Age.At.Death, which we also find in writers_df, with exactly the same values. You thus want to merge the two data frames on the basis of this variable:

data2 <- data.frame(Age.At.Death=c(22,40,72,41), Location=5:8)

We can easily merge these two:

new_writers_df <- merge(writers_df, data2)
new_writers_df

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 1           22            16  Jane     Doe FEMALE 2015-05-10        5
## 2           40            18 Edgar     Poe   MALE 1849-10-07        6
## 3           41            36  Jane  Austen FEMALE 1817-07-18        8
## 4           72            36  Walt Whitman   MALE 1892-03-26        7

Tip: check what happens if you change the order of the two arguments of the merge() function!

This way of merging is equivalent to an outer join in SQL.

Unfortunately, you’re not always this lucky with your data frames. In many cases, some of the columns names or variable values will differ, which makes it hard to follow the easy, standard procedure that was described just now. In addition, you may not always want to merge in the standard way that was described above. In the following, some of the most common issues are listed and solved!

What If… (Some Of) The Data Frame’s Column Values Are Different?

If (some of) the values of the variable on which you merge differ in the data frames, you have a small problem, because the merge() function supposes that these are the same so that any new variables that are present in the second data frame can be added correctly to the first data frame. Consider the following data frame:

data2 <- data.frame(x.Age.At.Death=c(21,39,71,40), Location=5:8)

You see that the values for the attribute Age.At.Death do not fit with the ones that were defined for the writers_df data frame.

No worries, the merge() function provides extra arguments to solve this problem. The argument all.x allows you to specify to add the extra rows of the Location variable to the resulting data frame, even though this column is not present in writers_df. In this case, the values of the Location variable will be added to the writers_df data frame for those rows of which the values of the Age.At.Death attribute correspond. All rows where the Age.At.Death of the two data frames don’t correspond, will be filled up with NA values.
Note that this join corresponds to a left outer join in SQL and that the default value of the all.x argument is FALSE, which means that one normally only takes into account the corresponding values of the merging variable. Compare with:

merge(writers_df, data2, all.x=FALSE)

You can also specify the argument all.y=TRUE if you want to add extra rows for each row that data2 has no matching row in writers_df:

merge(writers_df, data2, all.y=TRUE)

Note that this type of join corresponds to a right outer join in SQL.

What If… Both Data Frames Have The Same Column Names?

What if your two data frames have exactly the same two variables, with or without the same values?

data2 <- data.frame(Age.At.Death=c(21,39,71,40), Age.As.Writer=c(11,25,36,28))

You can chose to keep all values from all corresponding variables and to add rows to the resulting data frame:

merge(writers_df, data2, all=TRUE)

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1           21            11  <NA>    <NA>   <NA>       <NA>
## 2           22            16  Jane     Doe FEMALE 2015-05-10
## 3           39            25  <NA>    <NA>   <NA>       <NA>
## 4           40            18 Edgar     Poe   MALE 1849-10-07
## 5           40            28  <NA>    <NA>   <NA>       <NA>
## 6           41            36  Jane  Austen FEMALE 1817-07-18
## 7           71            36  <NA>    <NA>   <NA>       <NA>
## 8           72            36  Walt Whitman   MALE 1892-03-26

Or you can just chose to add values from one specific variable for which the ages of death correspond:

merge(writers_df, data2, by="Age.At.Death")

##   Age.At.Death Age.As.Writer.x  Name Surname Gender      Death
## 1           40              18 Edgar     Poe   MALE 1849-10-07
##   Age.As.Writer.y
## 1              28

What If… The Data Frames’ Column Names Are Different?

Lastly, what if the variable’s names on which you merge differ in the two data frames?

data2 <- data.frame(Age=c(22,40,72,41), Location=5:8)

You just specify in the merge() function that there are two other specifications through the arguments by.x and by.y.

merge(writers_df, data2, by.x="Age.At.Death", by.y="Age")

#   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 1           22            16  Jane     Doe FEMALE 2015-05-10        5
## 2           40            18 Edgar     Poe   MALE 1849-10-07        6
## 3           41            36  Jane  Austen FEMALE 1817-07-18        8
## 4           72            36  Walt Whitman   MALE 1892-03-26        7

Merging Data Frames On Row Names

You can indeed merge the columns of two data frames, that contain a distinct set of columns but some rows with the same names. The merge() function and its arguments come to the rescue!

Consider this second data frame:

Address <- c("50 West 10th", "77 St. Marks Place", "778 Park Avenue")
Maried <- c("YES", "NO", "YES")
limited_writers_df <- data.frame(Address, Maried)
limited_writers_df

##              Address Maried
## 1       50 West 10th    YES
## 2 77 St. Marks Place     NO
## 3    778 Park Avenue    YES

You see that this data set contains three rows, marked with numbers 1 to 3, and two additional columns that are not in the writers_df data frame. To merge these two data frames, we add the argument by to the merge() function and set it at the number 0, which specifies the row names. Since you choose to keep all values from all corresponding variables and to add columns to the resulting data frame, you set the all argument to TRUE:

writers_row_sorted <- merge(writers_df, limited_writers_df, by=0, all=TRUE)
writers_row_sorted

##   Row.names Age.At.Death Age.As.Writer  Name Surname Gender      Death
## 1         1           22            16  Jane     Doe FEMALE 2015-05-10
## 2         2           40            18 Edgar     Poe   MALE 1849-10-07
## 3         3           72            36  Walt Whitman   MALE 1892-03-26
## 4         4           41            36  Jane  Austen FEMALE 1817-07-18
##              Address Maried
## 1       50 West 10th    YES
## 2 77 St. Marks Place     NO
## 3    778 Park Avenue    YES
## 4               <NA>   <NA>

It could be that the fields for rows that don’t occur in both data frames result in NA-values. You can easily solve this by removing them. This will be discussed below.

How To Remove Data Frames’ Rows And Columns With NA-Values

To remove all rows that contain NA-values, one of the easiest options is to use the na.omit() function, which takes your data frame as an argument. Let’s recycle the code from the previous section in which two data frames were merged, with a lot of resulting NA-values:

data2 <- data.frame(Age.At.Death=c(21,39,71,40), Location=5:8)
merge <- merge(writers_df, data2, all.y=TRUE)
na.omit(merge)

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 3           40            18 Edgar     Poe   MALE 1849-10-07        8

If you just want to select part of your data frame from which you want to remove the NA-values, it’s better to use complete.cases(). In this case, you’re interested to keep all rows for which the values of the columns Age.As.Writer and Name are complete:

data2 <- data.frame(Age.At.Death=c(21,39,71,40), Location=5:8)
merge <- merge(writers_df, data2, all.y=TRUE)
merge[complete.cases(merge[,2:3]),]

##   Age.At.Death Age.As.Writer  Name Surname Gender      Death Location
## 3           40            18 Edgar     Poe   MALE 1849-10-07        8

How To Convert Lists Or Matrices To Data Frames And Back

From Lists or Matrices To Data Frames

Lists or matrices that comply with the restrictions that the data frame imposes can be coerced into data frames with the as.data.frame() function. Remember that a data frame is similar to the structure of a matrix, where the columns can be of different types. Data frames are also similar to lists, where each column is an element of the list and each element has the same length. Any matrices or lists that you want to convert to data frames need to satisfy with these restrictions.

For example, the matrix A can be converted to a data frame because each column contains values of the numeric data type:

A = matrix(c(2, 4, 3, 1, 5, 7), nrow=2, ncol=3, byrow = TRUE) 
A

##      [,1] [,2] [,3]
## [1,]    2    4    3
## [2,]    1    5    7

You enter the matrix A as an argument to the as.data.frame() function:

A_df <- as.data.frame(A)
A_df

##   V1 V2 V3
## 1  2  4  3
## 2  1  5  7

You can follow the same procedures for lists like the one that is shown below:

n = c(2, 3, 5) 
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
x = list(n, s, b, 3)
x_df <- as.data.frame(x)

Changing A Data Frame To A Matrix Or List

To make the opposite move, that is, to convert data frames to matrices and lists, you first have to check for yourself if this is possible. Does you data frame contain one or more dimensions and what about the amount of data types? Rewatch the small animation of the introduction if you’re not sure what data structure to pick.

Once you have an answer, you can use the functions as.matrix() and as.list() to convert your data frame to a matrix or a list, respectively:

writers_matrix <- as.matrix(writers_df)
writers_matrix

##      Age.At.Death Age.As.Writer Name    Surname   Gender   Death       
## [1,] "22"         "16"          "Jane"  "Doe"     "FEMALE" "2015-05-10"
## [2,] "40"         "18"          "Edgar" "Poe"     "MALE"   "1849-10-07"
## [3,] "72"         "36"          "Walt"  "Whitman" "MALE"   "1892-03-26"
## [4,] "41"         "36"          "Jane"  "Austen"  "FEMALE" "1817-07-18"

writers_list <- as.list(writers_df)
writers_list

## $Age.At.Death
## [1] 22 40 72 41
## 
## $Age.As.Writer
## [1] 16 18 36 36
## 
## $Name
## [1] "Jane"  "Edgar" "Walt"  "Jane" 
## 
## $Surname
## [1] "Doe"     "Poe"     "Whitman" "Austen" 
## 
## $Gender
## [1] FEMALE MALE   MALE   FEMALE
## Levels: FEMALE MALE
## 
## $Death
## [1] "2015-05-10" "1849-10-07" "1892-03-26" "1817-07-18"

For those of you who want to specifically make numeric matrices, you can use the function data.matrix() or add an sapply() function to the as.matrix() function:

writers_matrix <- data.matrix(writers_df)
writers_matrix <- as.matrix(sapply(writers_df, as.numeric))

Note that with the current writers_df data frame, which contains a mixture of data types, NA-values will be introduced in the resulting matrices.

From Data Frames To Data Analysis, Data Manipulation and Data Visualization

Data frames are just the beginning of your data analysis! There is much more to see and know about data frames and the other R data structures. If this tutorial has gotten you thrilled to dig deeper into programming with R, make sure to check out our free interactive Introduction to R course. Those of you who are already more advanced with R and that want to take their skills to a higher level might be interested in our courses on data manipulation and data visualization. Go to our course overview and take a look!

The post 15 Easy Solutions To Your Data Frame Problems In R appeared first on The DataCamp Blog .

To leave a comment for the author, please follow the link and comment on his blog: The DataCamp Blog » R.

↧

Shiny Wool Skeins

June 15, 2015, 7:00 am

≫ Next: Fishing for packages in CRAN

≪ Previous: 15 Easy Solutions To Your Data Frame Problems In R

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Chaos is not a pit: chaos is a ladder (Littlefinger in Game of Thrones)

Some time ago I wrote this post to show how my colleague Vu Anh translated into Shiny one of my experiments, opening my eyes to an amazing new world. I am very proud to present you the first Shiny experiment entirely written by me.

In this case I took inspiration from another previous experiment to draw some kind of wool skeins. The shiny app creates a plot consisting of chords inside a circle. There are to kind of chords:

Those which form a track because they are a set of glued chords; number of tracks and number of chords per track can be selected using Number of track chords and Number of scrawls per track sliders of the app respectively.
Those forming the background, randomly allocated inside the circle. Number of background chords can be chosen as well in the app

There is also the possibility to change colors of chords. This are the main steps I followed to build this Shiny app:

Write a simple R program
Decide which variables to parametrize
Open a new Shiny project in RStudio
Analize the sample UI.R and server.R files generated by default
Adapt sample code to my particular code (some iterations are needed here)
Deploy my app in the Shiny Apps free server

Number 1 is the most difficult step, but it does not depends on Shiny: rest of them are easier, specially if you have help as I had from my colleague Jorge. I encourage you to try. This is an snapshot of the app:

You can play with the app here.

Some things I thought while developing this experiment:

Shiny gives you a lot with a minimal effort
Shiny can be a very interesting tool to teach maths and programming to kids
I have to translate to Shiny some other experiment
I will try to use it for my job

Try Shiny: is very entertaining. A typical Shiny project consists on two files, one to define the user interface (UI.R) and the other to define the back end side (server.R).

This is the code of UI.R:

# This is the user-interface definition of a Shiny web application.
# You can find out more about building applications with Shiny here:
#
# http://shiny.rstudio.com
#

library(shiny)

shinyUI(fluidPage(

  # Application title
  titlePanel("Shiny Wool Skeins"),
  HTML("<p>This experiment is based on <a href="https://aschinchon.wordpress.com/2015/05/13/bertrand-or-the-importance-of-defining-problems-properly/">this previous one</a> I did some time ago. It is my second approach to the wonderful world of Shiny.</p>"),
  # Sidebar with a slider input for number of bins
  sidebarLayout(
    sidebarPanel(
      inputPanel(
        sliderInput("lin", label = "Number of track chords:",
                    min = 1, max = 20, value = 5, step = 1),
        sliderInput("rep", label = "Number of scrawls per track:",
                    min = 1, max = 50, value = 10, step = 1),
        sliderInput("nbc", label = "Number of background chords:",
                    min = 0, max = 2000, value = 500, step = 2),
        selectInput("col1", label = "Track colour:",
                    choices = colors(), selected = "darkmagenta"),
        selectInput("col2", label = "Background chords colour:",
                    choices = colors(), selected = "gold")
      )
      
    ),

    # Show a plot of the generated distribution
    mainPanel(
      plotOutput("chordplot")
    )
  )
))

And this is the code of server.R:

# This is the server logic for a Shiny web application.
# You can find out more about building applications with Shiny here:
#
# http://shiny.rstudio.com
#
library(ggplot2)
library(magrittr)
library(grDevices)
library(shiny)

shinyServer(function(input, output) {

  df<-reactive({
    ini=runif(n=input$lin, min=0,max=2*pi)
    ini %>% 
      +runif(n=input$lin, min=pi/2,max=3*pi/2) %>% 
      cbind(ini, end=.) %>% 
      as.data.frame() -> Sub1
    Sub1=Sub1[rep(seq_len(nrow(Sub1)), input$rep),]
    Sub1 %>% apply(c(1, 2), jitter) %>% as.data.frame() -> Sub1
    Sub1=with(Sub1, data.frame(col=input$col1, x1=cos(ini), y1=sin(ini), x2=cos(end), y2=sin(end)))
    Sub2=runif(input$nbc, min = 0, max = 2*pi)
    Sub2=data.frame(x=cos(Sub2), y=sin(Sub2))
    Sub2=cbind(input$col2, Sub2[(1:(input$nbc/2)),], Sub2[(((input$nbc/2)+1):input$nbc),])
    colnames(Sub2)=c("col", "x1", "y1", "x2", "y2")
    rbind(Sub1, Sub2)
  })
  
  opts=theme(legend.position="none",
             panel.background = element_rect(fill="white"),
             panel.grid = element_blank(),
             axis.ticks=element_blank(),
             axis.title=element_blank(),
             axis.text =element_blank())
  
  output$chordplot<-renderPlot({
    p=ggplot(df())+geom_segment(aes(x=x1, y=y1, xend=x2, yend=y2), colour=df()$col, alpha=runif(nrow(df()), min=.1, max=.3), lwd=1)+opts;print(p)
  }, height = 600, width = 600 )
  

})

To leave a comment for the author, please follow the link and comment on his blog: Ripples.

↧

Fishing for packages in CRAN

June 18, 2015, 8:30 am

≫ Next: Organize a walk around London with R

≪ Previous: Shiny Wool Skeins

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

It is incredibly challenging to keep up to date with R packages. As of today (6/16/15), there are 6,789 listed on CRAN. Of course, the CRAN Task Views are probably the best resource for finding what's out there. A tremendous amount of work goes into maintaining and curating these pages and we should all be grateful for the expertise, dedication and efforts of the task view maintainers. But, R continues to grow at a tremendous rate. (Have a look at growth curve in Bob Muenchen's 5/22/15 post R Now Contains 150 Times as Many Commands as SAS). CRANberries, a site that tracks new packages and package updates, indicates that over the last few months the list of R packages has been growing by about 100 packages per month. How can anybody hope to keep current?

So, on any given day, expect that finding out what R packages exist that may pertain to any particular topic will require some work. What follows, is a beginners guide to fishing for packages in CRAN. This example looks for "Bayesian" packages using some simple web page scraping and elementary text mining.

The Bayesian Inference Task View lists 144 packages. This is probably everything that is really important, but let's see what else is to be found that has anything at all to do with Bayesian Inference. In the first block of code, R's available.packages() function fetches the list of packages available from my Windows PC. (This is an extremely interesting function and I don't do justice to it here.) Then, this list is used to scrape the package descriptions from the various package webpages. The loop takes some time to run so I saved the package descriptions both in a csv file and a in a .RData workspace.

library(svTools)
library(RCurl)
library(tm)
#-----------------------------------------
# TWO HELPER FUNCTIONS
# Funcion to get ackage description from CRAN package page
getDesc <- function(package){
  l1 <- regexpr("</h2>",package)
  ind1 <- as.integer(l1[[1]]) + 9
  l2 <- regexpr("Version",package)
  ind2 <- as.integer(l2[[1]]) - (46 + nchar("package"))
  desc <- substring(package,ind1,ind2)
  return(desc)
}
 
# Function to get CRAN package page
getPackage <- function(name){
  url <- paste("http://cran.r-project.org/web/packages/",name,"/index.html",sep="")
  txt <- getURL(url,ssl.verifypeer=FALSE)
  return(txt)
}
#--------------------------------------------
# SCRAPE PACKAGE DATA FROM CRAN
# Get the list of R packages
packages <- as.data.frame(available.packages())
head(packages)
dim(packages)
 
pkgNames <- rownames(packages)
rm(packages)           # Dont need this any more
pkgDesc <- vector()
for (i in 1:length(pkgNames)){
 
  pkgDesc[i] <- getDesc(getPackage(pkgNames[i]))
}

length(pkgDesc) #6598
 
#----------------------------------------------
# SOME HOUSEKEEPING
# cranP <- data.frame(pkgNames,pkgDesc)
# write.csv(cranP,"C:/DATA/CRAN/CRAN_pkgs_6_15_15")
# save.image("pkgs.RData")
# load("pkgs.RData")

When I did this a few days ago 6,598 packages were available. The next section of code turns the vector of package descriptions into a document corpus and creates a document term matrix with a row for each package and 20,781worth of terms. Taking the transpose of the term matrix makes it easier to see what is going on. The matrix is extremely sparse (only one 1 shows up) as this small portion of the matrix illustrates and all of the terms are pretty much useless. Removing the sparse terms cuts the matrix down to only 372 terms.

# SOME SIMPLE TEXT MINING
# Make a corpus  out of package descriptions
pCorpus <- VCorpus(VectorSource(pkgDesc))
pCorpus
inspect(pCorpus[1:3])
 
# Function to prepare corpus
prepC <- function(corpus){
  c <- tm_map(corpus, stripWhitespace)
  c <- tm_map(c,content_transformer(tolower))
  c <- tm_map(c,removeWords,stopwords("english"))
  c <- tm_map(c,removePunctuation)
  c <- tm_map(c,removeNumbers)
  return(c)}
 
pCorpusPrep <- prepC(pCorpus)
 
#------------------------------------------------------------
# Create the document term matrix
dtm <- DocumentTermMatrix(pCorpusPrep)
dtm
# <<DocumentTermMatrix (documents: 6598, terms: 20781)>>
#   Non-/sparse entries: 142840/136970198
# Sparsity           : 100%
# Maximal term length: 83
# Weighting          : term frequency (tf)
 
 
# Work with the transpose to list keywords as rows
inspect(t(dtm[100:105,90:105]))
 
# Docs
# Terms          100 101 102 103 104 105
# accomodated    0   0   0   0   0   0
# accompanied    0   0   0   0   0   0
# accompanies    0   0   0   0   0   0
# accompany      0   0   0   0   0   0
# accompanying   0   0   0   0   0   0
# accomplished   0   0   0   0   0   0
# accomplishes   0   0   0   0   0   0
# accordance     0   0   0   0   0   0
# according      0   0   1   0   0   0
# accordingly    0   0   0   0   0   0
# accordinglyp   0   0   0   0   0   0
# account        0   0   0   0   0   0
# accounted      0   0   0   0   0   0
# accounting     0   0   0   0   0   0
# accountp       0   0   0   0   0   0
# accounts       0   0   0   0   0   0
 
 
# Reduce the number of sparse terms
dtms <- removeSparseTerms(dtm,0.99)
 
dim(dtms)  # 6598  372

I am pretty much counting on some luck here, hoping that "Bayesian" will be one of the remaining 372 terms. This last bit of code finds 229 packages associated with the keyword "Bayesian"

# Find the Bayesian packages
dtmsT <- t(dtms)
keywords <- row.names(dtmsT)                 
bi <- which(keywords == "bayesian")  # Find the index of an interesting keyword
 
bayes <- inspect(dtmsT)[bi,]         # Vexing that it prints to console
bayes_packages_index <- names(bayes[bayes==1])
 
# Here are the "Bayesian" packages
bayes_packages <- pkgNames[as.numeric(bayes_packages_index)]
length(bayes_packages) #229
 
# Here are the descriptions of the "Bayesian" packages
bayes_pkgs_desc <- pkgDesc[bayes==1])

Here is the list of packages found.

Not all of these "fish" are going to be worth keeping, but at least we have reduced the search to something manageable. In 10 or 15 minutes of fishing you might catch something interesting.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

Organize a walk around London with R

June 21, 2015, 3:25 am

≫ Next: Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

≪ Previous: Fishing for packages in CRAN

(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)

The subtitle of this post can be “How to plot multiple elements on interactive web maps in R“.
In this experiment I will show how to include multiple elements in interactive maps created using both plotGoogleMaps and leafletR. To complete the work presented here you would need the following packages: sp, raster, plotGoogleMaps and leafletR.

I am going to use data from the OpenStreet maps, which can be downloaded for free from this website: weogeo.com
In particular I downloaded the shapefile with the stores, the one with the tourist attractions and the polyline shapefile with all the roads in London. I will assume that you want to spend a day or two walking around London, and for this you would need the location of some hotels and the locations of all the Greggs in the area, for lunch. You need to create a web map that you can take with you when you walk around the city with all these customized elements, that’s how you create it.

Once you have downloaded the shapefile from weogeo.com you can open them and assign the correct projection with the following code:

stores <- shapefile("weogeo_j117529/data/shop_point.shp")
projection(stores)=CRS("+init=epsg:3857")
 
roads <- shapefile("weogeo_j117529/data/route_line.shp")
projection(roads)=CRS("+init=epsg:3857")
 
tourism <- shapefile("weogeo_j117529/data/tourism_point.shp")
projection(tourism)=CRS("+init=epsg:3857")

To extract only the data we would need to the map we can use these lines:

Greggs <- stores[stores$NAME %in% c("Gregg's","greggs","Greggs"),]
 
Hotel <- tourism[tourism$TOURISM=="hotel",]
Hotel <- Hotel[sample(1:nrow(Hotel),10),]
 
 
Footpaths <- roads[roads$ROUTE=="foot",]

plotGoogleMaps
I created three objects, two are points (Greggs and Hotel) and the last is of class SpatialLinesDataFrame. We already saw how to plot Spatial objects with plotGoogleMaps, here the only difference is that we need to create several maps and then link them together.
Let’s take a look at the following code:

Greggs.google <- plotGoogleMaps(Greggs,iconMarker=rep("http://local-insiders.com/wp-content/themes/localinsiders/includes/img/tag_icon_food.png",nrow(Greggs)),mapTypeId="ROADMAP",add=T,flat=T,legend=F,layerName="Gregg's",fitBounds=F,zoom=13)
Hotel.google <- plotGoogleMaps(Hotel,iconMarker=rep("http://www.linguistics.ucsb.edu/projects/weal/images/hotel.png",nrow(Hotel)),mapTypeId="ROADMAP",add=T,flat=T,legend=F,layerName="Hotels",previousMap=Greggs.google)
 
plotGoogleMaps(Footpaths,col="dark green",mapTypeId="ROADMAP",filename="Multiple_Objects_GoogleMaps.html",legend=F,previousMap=Hotel.google,layerName="Footpaths",strokeWeight=2)

As you can see I first create two objects using the same function and then I call again the same function to draw and save the map. I can link the three maps together using the option add=T and previousMap.
We need to be careful here though, because the use of the option add is different from the standard plot function. In plot I can call the first and then if I want to add a second I call again the function with the option add=T. Here this option needs to go in the first and second calls, not in the last. Basically in this case we are saying to R not to close the plot because later on we are going to add elements to it. In the last line we do not put add=T, thus saying to R to go ahead and close the plot.

Another important option is previousMap, which is used starting from the second plot to link the various elements. This option is used referencing the previous object, meaning that I reference the map in Hotel.google to the map map to Greggs.google, while in the last call I reference it to the previous Hotel.google, not the very first.

The zoom level, if you want to set it, goes only in the first plot.

Another thing I changed compared to the last example is the addition of custom icons to the plot, using the option iconMarker. This takes a vector of icons, not just one, with the same length of the SpatialObject to be plotted. That is why I use the function rep, to create a vector with the same URL repeated for a number of times equal to the length of the object.
The icon can be whatever image you like. You can find a collection of free icons from this website: http://kml4earth.appspot.com/icons.html

The result is the map below, available here: Multiple_Objects_GoogleMaps.html

leafletR
We can do the same thing using leafletR. We first need to create GeoJSON files for each element of the map using the following lines:

Greggs.geojson <- toGeoJSON(Greggs)
Hotel.geojson <- toGeoJSON(Hotel)
Footpaths.geojson <- toGeoJSON(Footpaths)

Now we need to set the style for each element. For this task we are going to use the function styleSingle, which basically defines a single style for all the elements of the GeoJSON. This differ from the map in a previous post in which we used the function styleGrad to create graduated colors depending of certain features of the dataset.
We can change the icons of the elements in leafletR using the following code:

Greggs.style <- styleSingle(marker=c("fast-food", "red", "s"))
Hotel.style <- styleSingle(marker=c("lodging", "blue", "s"))
Footpaths.style <- styleSingle(col="darkred",lwd=4)

As you can see we have the option marker that takes a vector with the name of the icon, its color and its size (between “s” for small, “m” for medium and “l” for large). The names of the icons can be found here: https://www.mapbox.com/maki/, where you have a series of icons and if you hover the mouse over them you would see some info, among which there is the name to use here, as the very last name. The style of the lines is set using the two options col and lwd, for line width.

Then we can simply use the function leaflet to set the various elements and styles of the map:

leaflet(c(Greggs.geojson,Hotel.geojson,Footpaths.geojson),style=list(Greggs.style,Hotel.style,Footpaths.style),popup=list(c("NAME"),c("NAME"),c("OPERATOR")),base.map="osm")

The result is the image below and the map available here: http://www.fabioveronesi.net/Blog/map.html

R code snippets created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: R tutorial for Spatial Statistics.

↧

Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

June 21, 2015, 10:37 pm

≫ Next: New Shiny cheat sheet and video tutorial

≪ Previous: Organize a walk around London with R

(This article was first published on Stable Markets » R, and kindly contributed to R-bloggers)

The NYPD provides publicly available data on stop and frisks with data dictionaries, located here. The data, ranging from 2003 to 2014, contains information on over 4.5 million stops. Several variables such as the age, sex, and race of the person stopped are included.

I wrote some R code to clean and compile the data into a single .RData file. The code and clean data set are available in my Github repository.

Here are some preliminary descriptive statistics:

The data shows some interesting trends:

Stops had been increasing steadily from 2003 to 2012, but falling since 2012.
The percentage of stopped persons who were black was consistently 3.5-6.5 times higher than the percentage of stopped persons who were white.
The data indicates whether or not officers explained the reason for stop to the stopped person. The data shows that police gave an explanation about 98-99% of the time. Of course, this involves a certain level of trust since the data itself is recorded by police. There is no difference in this statistic across race and sex.
The median age of stopped persons was 24. The distribution was roughly the same across race and sex.

A few notes on the data:

The raw data is saved as CSV files, one file for each year. However, the same variables are not tracked in each year. The .RData file on Github only contains select variables.
The importing and cleaning codes can take about 15 minutes to run.
All stops in all years have coordinates marking the location of the stop, however I’m still unable to make sense of them. I plan to publish another post with some spatial analyses.

The coding for this was particularly interesting because I had never used R to download ZIP files from the web. I reproduced this portion of the code below. It produces one dataset for each year from 2013 to 2014.

for(i in 2013:2014){
 temp <- tempfile()
 url<-paste("http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/",i,"_sqf_csv.zip",sep='')
 download.file(url,temp)
 assign(paste("d",i,sep=''),read.csv(unz(temp, paste(i,".csv",sep=''))))
}
unlink(temp)

To leave a comment for the author, please follow the link and comment on his blog: Stable Markets » R.

↧

New Shiny cheat sheet and video tutorial

June 22, 2015, 5:28 am

≫ Next: Stop and Frisk: Spatial Analysis of Racial Discrepancies

≪ Previous: Stop and Frisk: Blacks stopped 3-6 times more than Whites over 10 years

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’ve added two new tools that make it even easier to learn Shiny.

Video tutorial

The How to Start with Shiny training video provides a new way to teach yourself Shiny. The video covers everything you need to know to build your own Shiny apps. You’ll learn:

The architecture of a Shiny app
A template for making apps quickly
The basics of building Shiny apps
How to add sliders, drop down menus, buttons, and more to your apps
How to share Shiny apps
How to control reactions in your apps to
- update displays
- trigger code
- reduce computation
- delay reactions
How to add design elements to your apps
How to customize the layout of an app
How to style your apps with CSS

Altogether, the video contains two hours and 25 minutes of material organized around a navigable table of contents.

Best of all, the video tutorial is completely free. The video is the result of our recent How to Start Shiny webinar series. Thank you to everyone who attended and made the series a success!

Watch the new video tutorial here.

New cheat sheet

The new Shiny cheat sheet provides an up-to-date reference to the most important Shiny functions.

The cheat sheet replaces the previous cheat sheet, adding new sections on single-file apps, reactivity, CSS and more. The new sheet also gave us a chance to apply some of the things we’ve learned about making cheat sheets since the original Shiny cheat sheet came out.

Get the new Shiny cheat sheet here.

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

↧

Stop and Frisk: Spatial Analysis of Racial Discrepancies

June 23, 2015, 2:30 pm

≫ Next: Working with the RStudio CRAN logs

≪ Previous: New Shiny cheat sheet and video tutorial

(This article was first published on Stable Markets » R, and kindly contributed to R-bloggers)

Stops in 2014, with red lines indicated high white stop density areas and blue shades indicating high black stop density areas. Notice that high white stop density areas are very different from high black stop density areas. The star in Brooklyn marks the location where officers Lui and Ramos were killed. The star on Staten Island markets the location of Eric Garner's death. — Stops in 2014. Red lines indicate high white stop density areas and blue shades indicate high black stop density areas.
Notice that high white stop density areas are very different from high black stop density areas.
The star in Brooklyn marks the location of officers Liu’s and Ramos’ deaths. The star on Staten Island marks the location of Eric Garner’s death.

In my last post, I compiled and cleaned publicly available data on over 4.5 million stops over the past 11 years.

I also presented preliminary summary statistics showing that blacks had been consistently stopped 3-6 times more than whites over the last decade in NYC.

Since the last post, I managed to clean and reformat the coordinates marking the location of the stops. While I compiled data from 2003-2014, coordinates were available for year 2004 and years 2007-2014. All the code can be found in my GitHub repository.

My goals were to:

See if blacks and whites were being stopped at the same locations
Identify areas with especially high amounts of stops and see how these areas changed over time.

Killing two birds with one stone, I made density plots to identify areas with high and low stop densities. Snapshots were taken in 2 year intervals from 2007-2013. Stops of whites are indicated in red contour lines and stops of blacks are indicated in blue shades.

There are two things to note:

The snapshots indicate that, in those years, blacks and whites were stopped at very different locations. Whites were being stopped predominantly in Staten Island, Brooklyn, and Manhattan. There is very little overlap with high black stop density areas.
Blacks were stopped predominantly around the Brooklyn/Queens border and Manhattan/Bronx border.
These spatial discrepancies are consistent over the time given.
The high density areas are getting larger over time as the total number of stops decline (indicated by the range of the map legends).

Here is the map of stops in 2014, the last year for which I have data:

In 2014, we see more concentrated stops of blacks along the coast of Staten island. In fact, Eric Garner died in precisely one of these high-density areas. The location of his death is marked with a star.

Similarly, Officers Liu and Ramos also died in a high black stop density area (location marked with the star in Brooklyn).

Importance. It’s easy to see the importance of such spatial analyses. They add several layers of information on top of the basic summary statistics I presented in my previous post. As I’ve shown above, very terrible and unfortunate events can happen in high-density areas.

Simultaneity. Let’s say we overlay this stop and frisk data with perfectly measured crime data (the potential mismeasurement of “crime” is discussed below) and find that high black density areas actually have low crime density. We cannot necessarily conclude that the NYPD is engaging in a racist expansion of stops in black areas, despite low crime rates. What if crime rates are low because of the high amounts of stops? With the current data, it’s hard to say which way the causality would be run.

Unobserved Factors. Simultaneity aside, we also have unobserved factors to contend with. Are the spatial discrepancies visualized above due to racist police segmenting geographically to efficiently target blacks? Or are the spatial discrepancies simply due to the fact that blacks and whites, in general, live and/or hang out in very different places? Without additional data, it’s hard to say.

Difficulty Establishing Simple Claims. Even the relatively simple claim of “blacks commit crimes at higher than average rates” is difficult to establish. When most people speak of “crime rates”, they are actually referring to arrest rates. We usually don’t observe crimes because criminals aren’t generally upfront people who self-report their crimes. So, we use police arrest data as a proxy for crime. However, if we think that police are inherently racist, then the arrest data they record would also be biased upward. Arrest rates could be much higher than crime rates. My point is that even establishing simple claims requires great care (both in how we phrase the claims and how we attempt to answer them) and is often difficult.

Racism. As I said above, issues such as simultaneity and unobserved factors make it very difficult to establish even simple relationships or claims. It is even harder to establish the inherent racism of an entire group of people, or the inherent criminality of an entire group of people. Much more information is needed.

I hope that making this data available and clean for public use will help researchers address some of these difficulties. Again, all of my code and datasets are available on GitHub. My hope is that other people will combine this data with their own data to reach more impactful conclusions. As always, please cite when sharing.

To leave a comment for the author, please follow the link and comment on his blog: Stable Markets » R.

↧

Working with the RStudio CRAN logs

June 25, 2015, 8:30 am

≫ Next: KDD Cup 2015: The story of how I built hundreds of predictive models….And got so close, yet so far away from 1st place!

≪ Previous: Stop and Frisk: Spatial Analysis of Racial Discrepancies

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

The installr package has some really nice functions for working with the daily package download logs for the RStudio CRAN mirror which RStudio graciously makes available at http://cran-logs.rstudio.com/. The following code uses the download_RStudio_CRAN_data() function to download a month's worth of .gz compressed daily log files into the test3 directory and then uses the function read_RStudio_CRAN_data()to read all of these files into a data frame. (The portion of the status output provided shows the files being read in one at a time.). Next, the function most_downloaded_packages() calculates that the top six downloads for the month were: Rcpp, stringr, ggplot2, stringi, magrittr and plyr.

# CODE TO DOWNLOAD LOG RILES FROM RSTUDIO CRAN MIRROR
# FIND MOST DOWNLOADED PACKAGE AND PLOT DOWNLOADS
# FOR SELECTED PACKAGES
# -----------------------------------------------------------------
library(installr)
library(ggplot2)
library(data.table) #for downloading
 
# ----------------------------------------------------------------
# Read data from RStudio site
RStudio_CRAN_dir <- download_RStudio_CRAN_data(START = '2015-05-15',END = '2015-06-15', 
	                                           log_folder="C:/DATA/test3")
# read .gz compressed files form local directory
RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir)
 
#> RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir)
#Reading C:/DATA/test3/2015-05-15.csv.gz ...
#Reading C:/DATA/test3/2015-05-16.csv.gz ...
#Reading C:/DATA/test3/2015-05-17.csv.gz ...
#Reading C:/DATA/test3/2015-05-18.csv.gz ...
#Reading C:/DATA/test3/2015-05-19.csv.gz ...
#Reading C:/DATA/test3/2015-05-20.csv.gz ...
#Reading C:/DATA/test3/2015-05-21.csv.gz ...
#Reading C:/DATA/test3/2015-05-22.csv.gz ...
 
 
dim(RStudio_CRAN_data)
# [1] 8055660      10
 
# Find the most downloaded packages
pkg_list <- most_downloaded_packages(RStudio_CRAN_data)
pkg_list
 
#Rcpp  stringr  ggplot2  stringi magrittr     plyr 
  #125529   115282   103921   103727   102083    97183
 
lineplot_package_downloads(names(pkg_list),RStudio_CRAN_data)
 
# Look at plots for some packages
barplot_package_users_per_day("checkpoint",RStudio_CRAN_data)
#$total_installations
#[1] 359
barplot_package_users_per_day("Rcpp", RStudio_CRAN_data)
#$total_installations
#[1] 23832

The function lineplot_package_downloads() produces a multiple time series plot for the top five packages:

and the barplot_package_users_per_day() function provides download plots. Here we contrast downloads for the Revolution Analytics' checkpoint package and Rcpp.

Downloads for the checkpoint package look pretty uniform over the month. checkpoint is a relatively new, specialized package for dealing with reproducibility issues. The download pattern probably represents users discovering it. Rccp, on the other hand, is essential to an incredible number of other R packages. The right skewed plot most likely represents the tail end of the download cycle that started after Rcpp was upgraded on 5/1/15.

All of this works well for small amounts of data. However, the fact that read_RStudio_CRAN_data() puts everything in a data frame presents a bit of a problem for working with longer time periods with the 6GB of RAM on my laptop. So, after downloading the files representing the period (5/28/14 to 5/28/15) to my laptop,

# Convert .gz compresed files to .csv files
in_names <- list.files("C:/DATA/RStudio_logs_1yr_gz", pattern="*.csv.gz", full.names=TRUE)
out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1)
 
length(in_names)
for(i in 1:length(in_names)){
	df <- read.csv(in_names[i])
	write.csv(df, out_names[i],row.names=FALSE)
}

I used the external memory algorithms in Revolution R Enterprise to work with the data on disk. First, rxImport() brings all of the .csv files into a single .xdf file and stores it on my laptop. (Note that the rxSummary() function indicates that the file has over 90 million rows.) Then, the super efficient rxCube() function to tabulate the package counts.

# REVOSCALE R CODE TO IMPORT A YEARS WORTH OF DATA
data_dir <- "C:/DATA/RStudio_logs_1yr"
in_names <- list.files(data_dir, pattern="*.csv.gz", full.names=TRUE)
out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1)
 
#----------------------------------------------------
# Import to .xdf file
# Establish the column classes for the variables
colInfo <- list(
	       list(name = "date", type = "character"),
	       list(name = "time", type = "character"),
	       list(name = "size", type = "integer"),
	       list(name  = "r_version", type = "factor"), 
	       list(name = "r_arch", type = "factor"), 
	       list(name = "r_os", type = "factor"),
	       list(name = "package", type = "factor"),
	       list(name = "version", type = "factor"),
	       list(name = "country", type = "factor"),
	       list(name = "1p_1d", type = "integer"))
 
num_files <- length(out_names)
out_file <- file.path(data_dir,"RStudio_logs_1yr")
 
append = FALSE
for(i in 1:num_files){
rxImport(inData = out_names[i], outFile = out_file,     
		 colInfo = colInfo, append = append, overwrite=TRUE)
       	 append = TRUE
}	
# Look at a summary of the imported data
rxGetInfo(out_file)
#File name: C:DATARStudio_logs_1yrRStudio_logs_1yr.xdf 
#Number of observations: 90200221 
#Number of variables: 10 
 
# Long form tablualtion
cube1 <- rxCube(~ package,data= out_file)
# Computation time: 5.907 seconds.
cube1 <- as.data.frame(cube1)
sort1 <- rxSort(cube1, decreasing = TRUE, sortByVars = "Counts")
#Time to sort data file: 0.078 seconds
write.csv(head(sort1,100),"Top_100_Packages.csv")

Here are the download counts for top 100 packages for the period (5/28/14 to 5/28/15).

You can download this data here: Download Top_100_Packages

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

↧

The sjmisc-package

Labelled Data

Adding value labels as factor values

Getting and setting value and variable labels

Restore labels from subsetted data

Losing labels during subset

Add back labels

Executive summary

Structure

Introductory R

Interlude

Language R

Working with R

Favorite sentences

Appendix R

New Features:

Important Bug Fixes:

Other Changes:

* geomorph: Geometric Morphometric Analyses of 2D/3D Landmark Data

The “All-Rounders”

Overview

Formal Comparison of “All-Rounders”

General Observations

Reading Structured Data from Excel

What Else is There?

Conclusion

The Root: What’s A Data Frame?

The Basics Of Data Frames: The Questions And Solutions

How To Create A Simple Data Frame in R

How To Change A Data Frame’s Row And Column Names

How To Check A Data Frame’s Dimensions

How To Access And Change A Data Frame’s Values

…. Through The Variable Names

… Through The [,] and $ Notations

Why And How To Attach Data Frames

How To Apply Functions To Data Frames

Surpassing The Data Frame Basics: More Questions, More Answers

How To Create An Empty Data Frame

How To Extract Rows And Colums, Subseting Your Data Frame

How To Remove Columns And Rows From A Data Frame

How To Add Rows And Columns To A Data Frame

Why And How To Reshape A Data Frame From Wide To Long Format And Vice Versa

Using stack() For Simply Structured Data Frames

Using reshape() For Complex Data Frames

Reshaping Data Frames With tidyr

Reshaping Data Frames With reshape2

How To Sort A Data Frame

R’s Built-In Order() Function

Sorting With dplyr

How To Merge Data Frames

Merging Data Frames On Column Names

What If… (Some Of) The Data Frame’s Column Values Are Different?

What If… Both Data Frames Have The Same Column Names?

What If… The Data Frames’ Column Names Are Different?

Merging Data Frames On Row Names

How To Remove Data Frames’ Rows And Columns With NA-Values

How To Convert Lists Or Matrices To Data Frames And Back

From Lists or Matrices To Data Frames

Changing A Data Frame To A Matrix Or List

From Data Frames To Data Analysis, Data Manipulation and Data Visualization

Video tutorial

New cheat sheet

Using `stack()` For Simply Structured Data Frames

Using `reshape()` For Complex Data Frames

Reshaping Data Frames With `tidyr`

Reshaping Data Frames With `reshape2`

R’s Built-In `Order()` Function

Sorting With `dplyr`