Erno Pyykko/

Course Notes: Exploratory Data Analysis in SQL


Notes On SQL

On casting in SQL.

  • What's the difference with casting an 'integer' vs 'numeric'?
  • CAST() is a way to modify a data type within the executed query. It will not cascade the commands down to the database. Shorthand in SQL for CAST() is ::. I can see this being handy when building queries where I need to make sure data types match.
  • What's the calculation that runs behind the scenes for stddev() function?

On coalesce.

  • With COALESCE(col1, col2) I can fill missing values in col1 with col2's values.

Reminder for myself on subqueries, whether they're used in FROM or SELECT, they can only return one column. That means, I can compute information based on many columns, but the result can only be a single column.

New functions and revisting an old one.

  • trunc() will let me cut down the number of digits in the value, similar to round(), but it doesn't move the value up or down based on rounding rules.
  • generate_series() is similar to range() in python. I can use it to generate a range of ordered numbers in a column. It has the same arguments as well, from - to - step.
  • WITH [table_name] AS (SELECT * FROM table) did cause a bit of a struggle. WITH AS is a way to build a table that can be used to summarise data, without printing the results. This summary data can then be used in the core part of the query. How do I use this? (More complete example needed.)
  • corr() is a function that can be used to determine the relationship between two variables. Correlation is indicated by 0 to 1, where close to 1 indicates a stronger correlation and closer to 0 a weaker correlation between the two variables. It takes in two argumenst corr(arg1, arg2) where the arguments are the columns to be compared.
  • Median and Mean in SQL is a value that needs to be computed = percentile_disc() or percentile_cont() WITHIN GROUP (ORDER BY col) FROM table. There needs to be an operation to order the results. Important to remember that the percentile_disc() and precentile_cont() requires 0.5 as the argument to compute the mean or median. The number provided indicates to the algorithm what portion of the values to include in the calculation.
  • Temporary tables. Removing existing DROP TABLE IF EXIST x and CREATE TEMP TABLE [xyz] AS /n SELECT() FROM

On working with text data.

  • Text types are varchar and text.
  • Filtering text data with LIKE and ILIKE. Latter ignores the case of the string supplied inside '%x%'. Former matches the case. % is 0 or more characters, including spaces.
  • trim() can be used to remove characters or spaces from text data. prefixing trim() with r or l will remove characters from the beginning or end of the string. trim(col, '0123abcd') takes the column or value(s) as the first argument and the characters to be removed as a second argument.
  • lower() and upper() will set the case on the variable or column.
  • split_part(col/value, arg1, arg2) takes three arguments. It splits the provided col or value on the character arg1, using it as the delimiter, arg2 tells which part of the split string to return?
  • Tables can be updated with UPDATE [tablename] SET [col]=[value] WHERE [col]=[value]

On working with dates.

  • I've been using three different functions so far.
  • date_part() takes in two arguments, interval of the date an date column/value/field. In that order. date_part('day', date_col). It returns only that part of the date, scrapping the rest of the data. This is good for memory related reason?
  • date_trun() behaves the same way, except, it returns the requested interval and retains the information of the date outside of the interval.
  • EXTRACT(DOW FROM date_col), first argument tells the interval to extract from the date_col.

Mid assesment as a practice run: Q1: Return the rows that appears both in the movie_2000 and movie_2010 table. A: Intersect will return the rows from two tables with matches. Q2: Return the records if they have duplicates from the vendors table. A: Duplicate counting in SQL is done by using GROUP BY and COUNT() functions. Q3: From the vendors table, return the top 5 states with the largest number of vendors. A: Should have known this. Think more about the result vs what's already there. Here I only had to summarise data. Q4: From the fruit_2022 table, extract the month from the date column and convert the month to the NUMERIC data type. A: I used date_trunc() to extract the month, but it doesn't seem to work when casting to NUMERIC. In these situations I should use EXTRACT(MONTH FROM date_col)::NUMERIC Q5: From the movie_budget table, return the year with more than 3 movies in the table. A: Same as Q3, repeated mistake here. Q6: For each vendor_name in the vendors table, ensure that only the first letter of each word is upper case e.g. DATACAMP would become Datacamp. A: INITCAP() is a new function for me. Seems to do the same as .title() in Python. Q7: Add the rows from the movie_2010 table to movie_2000 but keep the duplicates. A: UNION ALL. What's the difference between UNION ALL and UNION? Q8: Return the number of duplicates as long as they are the same in the name and city column. A: Same as Q2. Reminder also to think about which columns need to be in the GROUP BY clause for the query not to result in an error. Q9: Add the rows from the movie_2010 table to movie_2000 but remove the duplicates. A: UNION. Here's my answer as well. UNION will not create duplicates, whereas UNION ALL will do so.

Summary of the mistakes.

  • Duplicate functions are 'manual' in SQL. Need to group the rows and count occurrences per category.
  • One new function to capitalise first letter of a word.
  • Data casting has some limitations with date_trunc vs. EXTRACT.
  • Merging tables, need to been in mind what INTERSECT vs. UNION vs. UNION ALL will return.

Skills assessment 'Data Management in SQL (PostgreSQL)' notes:

  • For combining string data between two columns in SQL use CONCAT(col1,col2,' add anyother content')
  • In order to get specifics out of a date use EXTRACT(month FROM date::DATE). That example takes the month from a date as the month number.
  • SQL method matching Python's .title() is INITCAP() maybe it will help memorizing it as 'initial capital'.
  • Once again EXTRACT() to the rescue. Getting a day as a number EXTRACT(DAY FROM Start_Time) :: Numeric
  • When doing self joins, there's no need to alias the tables with AS. I can just create additional tables by 'tablename a' and 'tablename b'.
  • I need to work on INNER, OUTER, LEFT, RIGHT joins and their use cases.
  • To extract digits from a string I can use LEFT(col1, 5) where the 5 is representing the number of digits counting from left.
  • When joining tables where the linking columns have the same names USING() is probably the best option for that.
  • When filtering has to take place after grouping of data, it can be done with HAVING(). WHERE is used before grouping to filter the data to be grouped.
  • UNION gets rid of duplicates in the joined tables and UNION ALL keeps the duplicates.