Skip to main content

Basics and Fundamentals of Data Cleaning

In every formal study and research, a presentation of Analysis is necessary. It is to give weight and justification on case that is being proposed and presented, without the analysis it cannot be properly called a study and will only classified as an article. In the same context, the data cleaning or how is the data presented and utilized must be discussed and defended in order to make the proposal, research or study be believable.


There are many types of study and research, and most (if not all) of them requires data gathering, a survey or an experiment to a specimen.


Commonly, data cleaning comes after the fact. Or after the data are gathered on subjects. This is the most common type or the Reactive Data Cleaning. There is another type of Data Cleaning and that is the Proactive Data Cleaning.


The Proactive Data Cleaning involves planning of the Data Gathering, crafting a threshold, setting up conditional tests and properly defining the tests, questionnaires and forms. The keyword is planning.


And here we are -the Basic Idea and Fundamentals of Data Cleaning. It requires the researchers to have at least an intermediate knowledge on using MS. Excel.

1    The Building Blocks of Data Cleaning

 

Lego - depicting unprocessed data

Variables and respondents (or specimens) are prepared and encoded usually in MS Excel.

1. Columns – by standard it contains the information related to one particular variable or question

2. Rows – by standard it contains the unit (or the sample). If the data came from a survey, row signifies the particular respondent. 

3. A row must have a (UID) unique identifier (like a student number, serial number etc.) to allow easier identification and processing of the data.

4. Cell – is an intersection of the row and column. This represents a response on a specific question by a specimen.

2    Sanity Checks

 

sanity checking


The sanity checking of the gathered data  is best done in tabular format of MS Excel. Below things  are to watch out and look into;

The number of rows must be equal to the number of respondents. Checking for duplicate entries must be carried out to ensure that all respondents are unique.

The number of columns must be the same as the number of questions (or test in a sample).  It can be done by checking data types in the column (eg. Numeric, text, etc.).

Time stamp – the time format should be uniform in the table to avoid any misrepresentation.

Geo-coded variables – it is necessary when a part of the study includes the actual location. This is usually tagged by Longitude and Latitude to pinpoint the location.

o To ensure that the coordinates is correct or the respondent is in the location covered by the study. Upper and Lower limit of the long/lat can be defined by the researcher, then use the “IF” & “AND” function of MS Excel.


3          The Different Question Types

 

question types

This is basically finding responses or answers that are not within the defined limits or predetermined response.(e.g., Response should be ordinal –1st to 5th, ratio, whole number, character length etc.) Each question should have a definite character type / response type from the questionnaire. By anticipating this, data cleaning will be easier to do. However, this should be noted that it is only done to make the processing faster without diluting the results.

In excel, inconsistencies are detected by Data Validation and by Sorting and Filtering.


4 The Missing Data

 

A missing piece in a puzzle affects all
 

Planning/ Data Gathering


The missing data can be avoided during the Data Gathering or survey by carefully planning the questionnaire or tests to specimen / respondents. During planning, the ambiguous choices are limited or eliminated. If the survey is to be done online, the forms are made with limited exit points and that the pages per forms must contain as much as possible the majority of the needed data.

Questionnaire can also be improved by introducing the independent and dependent variable (predictor and response). The dependent variable establishes the pattern while the independent variable shed light on the behavior

 

Missing Data on Post-survey


There are two common ways to deal with the missing data – the first is by deleting it, if there is no sense keeping it and there is no significant data that can be extracted from the rest of the variables.The other one is the by imputation by means (any) of mode, mean and median – which is applicable to numeric data. This can be done and processed in excel.

However, the researcher should limit on how much they can use this option, as it may dilute the data. The common acceptable limit is not to use it in 3 consecutive stages.

There is one more that is rarely used, but is applicable Geo location data. It is known as the Last Observation Carried Forward. It is basically guessing but with consideration of its previous data. If further study is to be used, then Mode and Median will be applicable to this.

5          The Outliers

the outlier - is something not ordinary on the set of data

The Outliers are data or variable that is abnormal or significantly different. It can be correct or is wrong and therefore can make the plot skew significantly.

To find the outliers in the data set, the table can be visualized thru graphs in excel, commonly used are; histogram scatterplot. It can also be checked by using “sort” in excel.

If upon validation (or resurvey), it is proven that the value is correct, then keep it. This is an interesting findings on the study. However, if it is incorrect then there is no reason to keep it on the data table.

6          Conditional Questions and Tackling Them

 

Flow chart

Sometimes a conditional question must be incorporated to the study, and because of it. The data cleaning might be a bit complex.

To clean the variables in conditional question, the flow chart must be first clearly understood. Once it is clear then the “IF” and “Else” function of excel will be very handy to find any errors on the data.

7          Joining, Delimiting and Concatenating

 Excel


Sometimes, data or variables must be pre-processed in order to give more insights. Variables are sometimes- joined, delimited and concatenated. All of this can be done in excel.

 Delimiting is when the cells (can be a range, column or row) containing similarly formatted information are separated. (eg original format of 11292020T01300 is delimited to form a two separate cell and becomes 11292020, T01300.

Delimiting

Concatenating is the opposite of the delimiting. It combines the contents of other cells into one cell.

Concatenating


                                                            Reference:The Ultimate Guide to Basic Data Cleaning - Atlan


<<<____________________________________________________________>>>


If you want to know more, have time to watch this video made by Data Engineering.



Comments

Popular posts from this blog

Social Awareness

 What is Social Awareness? How it can help and Improve an Individual and the Nation as a whole?  Social Awareness is an ability to know and understand the needs of other people - and do something about it. It is being sensitive and knowing what to do with it. If a person has a low social awareness, their attitude and behavior will appear to be selfish or self serving to the society in which they are involved. Getting the right response to the needs of other people improves social standing and success. It is a must attribute for a leader or manager. This is why, in a corporate world, the management puts an emphasis on their managers to be socially aware. This skill will bring a person a long way and of course their organization as well. It is needless to say that being socially aware is a de-facto core competency aside from the technical skills or managing skills. A harmonious society needs its people to be respectful towards its other, a pre-requisite of productive and pros

Are you Stressed or are you Pressured in your Job?

Stressed or Pressured? Which is which? In a corporate world, you might hear some sayings that being stressed is bad and being pressured is better than the latter. Why is that? But what do you really feel? According to the dictionary, stressed is defined as "experiencing mental or emotional strain or tension". While, Pressured  is described as a "compelling or constraining influence, such as persuasion or negative attitudes, on the mind or will." But if you are familiar to the engineering definition of both words, then at least you will have an idea why being pressured is better than being stressed. This is the reason why usually you will hear that it is better to be pressured. Here are the definition of Stress and Pressure according to principle of Physics. Stress - It is the intensity of the INTERNAL resisting force, developing at a point. It can be acting  normal (90 degrees), parallel to the surface or both. Pressure - It is the intensity of the EXTERNAL resistin

Free! Cloud Storage - 50GB

In this age and digital era, we always wants to keep a part of memory. This is always true if you are a learner that strives to keep more knowledge and information. And there are many platforms that offers online storage - or the Cloud Storage. These are the following and most common cloud providers; Google Drive (by Google, Inc.) One Drive (by Microsoft Corporation) Smart Phone Sponsored Cloud Storage (capacity and quality varies on brand) Dropbox Cloud computing and cloud storage are now the trend for sharing large files and even Company applications are now using this. In this time of pandemic, the Cloud solution has provided many business an alternative and saving costs in its maintenance. Cloud computing has proven its worth more - it is really the future. It transform people in sharing files and how people collaborate together. But one cloud provider has captured attention of the many netizens and business companies. This is because, cloud computing is their flagship, and so, the