data shipped with R

loading data

You can simply use data that is natively provided by R. A famous example is the titanic data set. Do this using the data() command as provided below

There is not a lot that happened, because we do not use an IDE. An IDE like RStudio would acutally display the variables. This is a trade-off we have to live with the convinienct of not setting up R, but of course, there are ways around it. Below are a couple of ways to display data: - the printcommand is the most verbose way to output data - the head command prints only the first couple of lines (can be specified using head(object, n = X) - the glimpse command is very convenient, but needs the dplyr package to be loaded or referenced (with :: as shown below) - the str command give the structure of the object, so it provides a meta view of the variable. This is convenient if you have some unexpected output. - the class() command returns the class of the object

inspecting data

Ok, so now we know something about the data, but what does it all mean? Luckily, with in-built datasets, there usually comes a pretty good documentation, which we can call using the ?.

conversion to data frame

You can see, that the data is stored in a table. This is a format, that has been around for a while and it is perfectly fine to work with, however often it is more convenient to use a dataframe.

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

  • The column names should be non-empty.
  • The row names should be unique.
  • The data stored in a data frame can be of numeric, factor or character type.
  • Each column should contain same number of data items.

dataframes also should be tidy (Wickham 2014). But what is a tidy dataframe? There is a long philosophical debate, that is interesting to follow, but hold only little practical value. The ground rule for tidy data is:

Important

Each column is a variable, each observation is a row.

This means, that sometimes data is repeated for the sake of clarity. For everyone who is new to this concept, this first sounds over the top, but once you get used to it, you will never want to look at data any other way.

This is also where the name tidyverse comes from. Once you stick to that fundamental design philosophy, the functions in the tidyverse are very simple to use.

Back to the conversion, so … how?

You can see above the principle of tidy data.

Now, the glimpse() command also give a different result.

You can also play around with the summary() function on the Titanic table or the Titanic dataframe.

References

Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59. http://www.jstatsoft.org/v59/i10/.