data shipped with R
loading data
You can simply use data that is natively provided by R. A famous example is the titanic data set. Do this using the data()
command as provided below
There is not a lot that happened, because we do not use an IDE. An IDE like RStudio would acutally display the variables. This is a trade-off we have to live with the convinienct of not setting up R, but of course, there are ways around it. Below are a couple of ways to display data: - the print
command is the most verbose way to output data - the head
command prints only the first couple of lines (can be specified using head(object, n = X
) - the glimpse
command is very convenient, but needs the dplyr
package to be loaded or referenced (with ::
as shown below) - the str
command give the structure of the object, so it provides a meta view of the variable. This is convenient if you have some unexpected output. - the class()
command returns the class of the object
inspecting data
Ok, so now we know something about the data, but what does it all mean? Luckily, with in-built datasets, there usually comes a pretty good documentation, which we can call using the ?
.
conversion to data frame
You can see, that the data is stored in a table. This is a format, that has been around for a while and it is perfectly fine to work with, however often it is more convenient to use a dataframe
.
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
- The column names should be non-empty.
- The row names should be unique.
- The data stored in a data frame can be of numeric, factor or character type.
- Each column should contain same number of data items.
dataframe
s also should be tidy (Wickham 2014). But what is a tidy dataframe
? There is a long philosophical debate, that is interesting to follow, but hold only little practical value. The ground rule for tidy data is:
Each column is a variable, each observation is a row.
This means, that sometimes data is repeated for the sake of clarity. For everyone who is new to this concept, this first sounds over the top, but once you get used to it, you will never want to look at data any other way.
This is also where the name tidyverse comes from. Once you stick to that fundamental design philosophy, the functions in the tidyverse are very simple to use.
Back to the conversion, so … how?
You can see above the principle of tidy data.
Now, the glimpse()
command also give a different result.
You can also play around with the summary()
function on the Titanic table
or the Titanic dataframe
.