In every formal study and research, a presentation of Analysis is necessary. It is to give weight and justification on case that is being proposed and presented, without the analysis it cannot be properly called a study and will only classified as an article. In the same context, the data cleaning or how is the data presented and utilized must be discussed and defended in order to make the proposal, research or study be believable.
There are many types of study and research, and most (if not all) of them requires data gathering, a survey or an experiment to a specimen.
Commonly, data cleaning comes after the fact. Or after the data are gathered on subjects. This is the most common type or the Reactive Data Cleaning. There is another type of Data Cleaning and that is the Proactive Data Cleaning.
The Proactive Data Cleaning involves planning of the Data Gathering, crafting a threshold, setting up conditional tests and properly defining the tests, questionnaires and forms. The keyword is planning.
And here we are -the Basic Idea and Fundamentals of Data Cleaning. It requires the researchers to have at least an intermediate knowledge on using MS. Excel.
1 The Building Blocks of Data Cleaning
Variables and respondents (or specimens) are prepared and encoded usually in MS Excel.
1. Columns – by standard it contains the information related to one particular variable or question
2. Rows – by standard it contains the unit (or the sample). If the data came from a survey, row signifies the particular respondent.
3. A row must have a (UID) unique identifier (like a student number, serial number etc.) to allow easier identification and processing of the data.
4. Cell – is an intersection of the row and column. This represents a response on a specific question by a specimen.
2 Sanity Checks
The sanity checking of the gathered data is best done in tabular format of MS Excel. Below things are to watch out and look into;
• The number of rows must be equal to the number of respondents. Checking for duplicate entries must be carried out to ensure that all respondents are unique.
• The number of columns must be the same as the number of questions (or test in a sample). It can be done by checking data types in the column (eg. Numeric, text, etc.).
• Time stamp – the time format should be uniform in the table to avoid any misrepresentation.
• Geo-coded variables – it is necessary when a part of the study includes the actual location. This is usually tagged by Longitude and Latitude to pinpoint the location.
o To ensure that the coordinates is correct or the respondent is in the location covered by the study. Upper and Lower limit of the long/lat can be defined by the researcher, then use the “IF” & “AND” function of MS Excel.
3 The Different Question Types
This is basically finding responses or answers that are not within the defined limits or predetermined response.(e.g., Response should be ordinal –1st to 5th, ratio, whole number, character length etc.) Each question should have a definite character type / response type from the questionnaire. By anticipating this, data cleaning will be easier to do. However, this should be noted that it is only done to make the processing faster without diluting the results.
In excel, inconsistencies are detected by Data Validation and by Sorting and Filtering.
4 The Missing Data
Planning/ Data Gathering
Questionnaire can also be improved by introducing the independent and dependent variable (predictor and response). The dependent variable establishes the pattern while the independent variable shed light on the behavior
Missing Data on Post-survey
There are two common ways to deal with the missing data – the first is by deleting it, if there is no sense keeping it and there is no significant data that can be extracted from the rest of the variables.The other one is the by imputation by means (any) of mode, mean and median – which is applicable to numeric data. This can be done and processed in excel.
However, the researcher should limit on how much they can use this option, as it may dilute the data. The common acceptable limit is not to use it in 3 consecutive stages.
There is one more that is rarely used, but is applicable Geo location data. It is known as the Last Observation Carried Forward. It is basically guessing but with consideration of its previous data. If further study is to be used, then Mode and Median will be applicable to this.
5 The Outliers
The Outliers are data or variable that is abnormal or significantly different. It can be correct or is wrong and therefore can make the plot skew significantly.
To find the outliers in the data set, the table can be visualized thru graphs in excel, commonly used are; histogram scatterplot. It can also be checked by using “sort” in excel.
If upon validation (or resurvey), it is proven that the value is correct, then keep it. This is an interesting findings on the study. However, if it is incorrect then there is no reason to keep it on the data table.
6 Conditional Questions and Tackling Them
Sometimes a conditional question must be incorporated to the study, and because of it. The data cleaning might be a bit complex.
To clean the variables in conditional question, the flow chart must be first clearly understood. Once it is clear then the “IF” and “Else” function of excel will be very handy to find any errors on the data.
7 Joining, Delimiting and Concatenating
Sometimes, data or variables must be pre-processed in order to give more insights. Variables are sometimes- joined, delimited and concatenated. All of this can be done in excel.
Delimiting is when the cells (can be a range, column or row) containing similarly formatted information are separated. (eg original format of 11292020T01300 is delimited to form a two separate cell and becomes 11292020, T01300.
Concatenating is the opposite of the delimiting. It combines the contents of other cells into one cell.
Reference:The Ultimate Guide to Basic Data Cleaning - Atlan
<<<____________________________________________________________>>>
If you want to know more, have time to watch this video made by Data Engineering.
Comments
Post a Comment
Keep your comment considerate and respectful.