Cleaning Data in SQL
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Cleaning Data in SQL

    Data from your queries may often contain missing values, duplicate rows, or data in the wrong format. It is crucial to be able to clean the data you retrieve through your queries. This template runs through the basics of handling missing data, duplicate rows, and string cleaning.

    This template connects to a database containing different course datasets by default. However, you can use any integration you want, so long as you have access to a table with text data. You can learn more about integrations here.

    💡  The data we will use is contained various schemas of the Course Databases database. To edit the queries, you must add the schema name when referencing tables in this database (e.g., world.currencies).

    Missing values

    Missing or null values are common in data and can often present a problem for future analyses. You can inspect the number of missing values in a column by filtering the data for null values and performing an aggregation.

    In the example below, we filter for records where the unemployment_rate is NULL and then COUNT() the number of rows.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT COUNT(*) AS number_missing_unemployment_rates
    FROM world.economies
    WHERE unemployment_rate IS NULL
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Filling missing values

    Once you have identified the missing values, you can handle them in several ways.

    Here, we use COALESCE() to replace NULL values with the average unemployment rate (which is accessed via a subquery). COALESCE() works by returning the first argument if it is not null. If it is null, it returns the second argument, and so forth. If the unemployment_rate column is NULL, it returns the second argument, which is the average unemployment we calculate with a subquery.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT
    	code,
        unemployment_rate,
    	COALESCE(unemployment_rate, 
                 (SELECT AVG(unemployment_rate) FROM world.economies)) AS filled_unemployment_rate
    FROM world.economies
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Duplicate rows

    Another data issue you may come across is duplicate rows. To identify them, you can use ROW_NUMBER() to assign numbers to rows based on identical combinations. By choosing the PARTITION of the window function, you can specify over which columns you want to look for duplicates.

    In the example below, we use PARTITION BY to assign row numbers based on the combination of country code and unemployment rate. As you can see from the query results, duplicate rows have a value of 2 or greater.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT *
    FROM (
        SELECT 
            code, 
            unemployment_rate,
            ROW_NUMBER() OVER(PARTITION BY code, unemployment_rate) AS row_number
        FROM world.economies
    ) AS sub
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Discarding duplicate rows

    Removing duplicate rows is just as simple as identifying them. To do so, you simply need to change your filter to select row_numbers with a value of 1.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT *
    FROM (
        SELECT 
            code, 
            unemployment_rate,
            ROW_NUMBER() OVER(PARTITION BY code, unemployment_rate) AS row_number
        FROM world.economies
    ) AS sub
    WHERE row_number = 1
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Invalid data

    If you are aware of invalid data, you can find (and remove) it using pattern-matching.

    In the example below, we search for rows where the indep_year contains a negative value. To do so, we convert the column to text using ::TEXT, and then use LIKE and our pattern. The pattern we use searches for a minus sign (-), followed by any other characters (using the wildcard %).

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT indep_year
    FROM world.countries
    WHERE indep_year::TEXT LIKE '-%'
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    You can also use pattern matching to find rows with similar variants. In the example below, we use a pattern to identify all rows with Monarchy in the gov_form column.

    Our pattern searches for any row with "Monarchy". We use the % wildcard characters to allow for words/whitespace on either side of the word we are searching for.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT DISTINCT name, gov_form
    FROM world.countries
    WHERE gov_form LIKE '%Monarchy%'
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    Fixing invalid data

    There are a variety of ways you can fix invalid data. One way is to use a CASE statement to recategorize the data. In the example below, we convert all gov_form rows that contain "Monarchy" to "Monarchy". The remaining entries are left as they are.

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT DISTINCT 
    	name, 
        gov_form,
        CASE WHEN gov_form LIKE '%Monarchy%' THEN 'Monarchy' 
        ELSE gov_form END AS fixed_gov_form
    FROM world.countries
    WHERE gov_form LIKE '%Monarchy%'
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.