Lesson 2: Data Understanding
What are the components of a dataset?
A dataset is a collection of data. It consists of features. Each feature represents a piece of data that can be used in data analysis, for example: Genre, Artist, Song Name, and so on.
In Microsoft Excel, for example, features are represented by columns in a sheet. An Excel sheet is essentially a data frame, that is, a 2D representation of data.
Features have a datatype. Datatypes include:
- Numeric
- Binary
- Ordinal
- Interval
- Categorical
- Textual
The first row (the header) in a data frame explains what the columns mean.
For example, the data frame below displays school students and the subjects they are taking. It has headers of Name, Age, Year, Computer Science, and History.
The name is textual, the age and year are numeric and the subjects, Computer Science, and History, are binary (denoting whether they are studying the subject)
Name | Age | Year | Computer Science | History |
---|---|---|---|---|
Mary | 17 | 6 | TRUE | FALSE |
John | 18 | 6 | TRUE | TRUE |
Peter | 16 | 5 | TRUE | FALSE |
Jane | 16 | 5 | FALSE | TRUE |
This format simplifies getting the information that you want (for example, what 6th years are studying Computer Science?).
Mean, Mode, and Median
There are three main methods that are used in statistics to get an average. These are the mean, the mode, and the median.
Mean
The mean is the average within a dataset. In the example above, if we take the feature “Age”, the mean age would be found by adding up all the ages, and dividing by the number of ages in this instance, which is 4. 17 + 18 + 16 +16 = 67. Dividing the total number of ages (67) by the number of pupils (4) gives us a mean age of 16.75. This, therefore, indicates that the mean or average age of this group is 16.75
Try a more complex example
Mode
The mode refers to the most common or the most frequent number in a given dataset. In the same example, if we take the four ages above 16, the mode of this set would be 16, as it is the most common age.
Try a more complex example
Median
The median refers to the middle value in a set of numbers. In this instance, if we take our feature “age” again, but add a new entry where we have another student who is 18. That would give us the following set 18. With the median, we need to sort these numbers, which would give us 18. In this example, the median age would be 17 as it is the middle value. If the set has an even number of values, we would take the two middle values and divide these by two for the median.
Try a more complex example