Course Notes: Exploratory Data Analysis in SQL

Beta

Notes On SQL

On casting in SQL.

What's the difference with casting an 'integer' vs 'numeric'?

CAST() is a way to modify a data type within the executed query. It will not cascade the commands down to the database. Shorthand in SQL for CAST() is ::. I can see this being handy when building queries where I need to make sure data types match.

What's the calculation that runs behind the scenes for stddev() function?

On coalesce.

With COALESCE(col1, col2) I can fill missing values in col1 with col2's values.

Reminder for myself on subqueries, whether they're used in FROM or SELECT, they can only return one column. That means, I can compute information based on many columns, but the result can only be a single column.

New functions and revisting an old one.

trunc() will let me cut down the number of digits in the value, similar to round(), but it doesn't move the value up or down based on rounding rules.
generate_series() is similar to range() in python. I can use it to generate a range of ordered numbers in a column. It has the same arguments as well, from - to - step.
WITH [table_name] AS (SELECT * FROM table) did cause a bit of a struggle. WITH AS is a way to build a table that can be used to summarise data, without printing the results. This summary data can then be used in the core part of the query. How do I use this? (More complete example needed.)
corr() is a function that can be used to determine the relationship between two variables. Correlation is indicated by 0 to 1, where close to 1 indicates a stronger correlation and closer to 0 a weaker correlation between the two variables. It takes in two argumenst corr(arg1, arg2) where the arguments are the columns to be compared.
Median and Mean in SQL is a value that needs to be computed = percentile_disc() or percentile_cont() WITHIN GROUP (ORDER BY col) FROM table. There needs to be an operation to order the results. Important to remember that the percentile_disc() and precentile_cont() requires 0.5 as the argument to compute the mean or median. The number provided indicates to the algorithm what portion of the values to include in the calculation.
Temporary tables. Removing existing DROP TABLE IF EXIST x and CREATE TEMP TABLE [xyz] AS /n SELECT() FROM

On working with text data.

Text types are varchar and text.
Filtering text data with LIKE and ILIKE. Latter ignores the case of the string supplied inside '%x%'. Former matches the case. % is 0 or more characters, including spaces.
trim() can be used to remove characters or spaces from text data. prefixing trim() with r or l will remove characters from the beginning or end of the string. trim(col, '0123abcd') takes the column or value(s) as the first argument and the characters to be removed as a second argument.
lower() and upper() will set the case on the variable or column.
split_part(col/value, arg1, arg2) takes three arguments. It splits the provided col or value on the character arg1, using it as the delimiter, arg2 tells which part of the split string to return?
Tables can be updated with UPDATE [tablename] SET [col]=[value] WHERE [col]=[value]

On working with dates.

I've been using three different functions so far.
date_part() takes in two arguments, interval of the date an date column/value/field. In that order. date_part('day', date_col). It returns only that part of the date, scrapping the rest of the data. This is good for memory related reason?
date_trun() behaves the same way, except, it returns the requested interval and retains the information of the date outside of the interval.
EXTRACT(DOW FROM date_col), first argument tells the interval to extract from the date_col.

Mid assesment as a practice run: Q1: Return the rows that appears both in the movie_2000 and movie_2010 table. A: Intersect will return the rows from two tables with matches. Q2: Return the records if they have duplicates from the vendors table. A: Duplicate counting in SQL is done by using GROUP BY and COUNT() functions. Q3: From the vendors table, return the top 5 states with the largest number of vendors. A: Should have known this. Think more about the result vs what's already there. Here I only had to summarise data. Q4: From the fruit_2022 table, extract the month from the date column and convert the month to the NUMERIC data type. A: I used date_trunc() to extract the month, but it doesn't seem to work when casting to NUMERIC. In these situations I should use EXTRACT(MONTH FROM date_col)::NUMERIC Q5: From the movie_budget table, return the year with more than 3 movies in the table. A: Same as Q3, repeated mistake here. Q6: For each vendor_name in the vendors table, ensure that only the first letter of each word is upper case e.g. DATACAMP would become Datacamp. A: INITCAP() is a new function for me. Seems to do the same as .title() in Python. Q7: Add the rows from the movie_2010 table to movie_2000 but keep the duplicates. A: UNION ALL. What's the difference between UNION ALL and UNION? Q8: Return the number of duplicates as long as they are the same in the name and city column. A: Same as Q2. Reminder also to think about which columns need to be in the GROUP BY clause for the query not to result in an error. Q9: Add the rows from the movie_2010 table to movie_2000 but remove the duplicates. A: UNION. Here's my answer as well. UNION will not create duplicates, whereas UNION ALL will do so.

Summary of the mistakes.

Duplicate functions are 'manual' in SQL. Need to group the rows and count occurrences per category.
One new function to capitalise first letter of a word.
Data casting has some limitations with date_trunc vs. EXTRACT.
Merging tables, need to been in mind what INTERSECT vs. UNION vs. UNION ALL will return.

Skills assessment 'Data Management in SQL (PostgreSQL)' notes:

For combining string data between two columns in SQL use CONCAT(col1,col2,' add anyother content')
In order to get specifics out of a date use EXTRACT(month FROM date::DATE). That example takes the month from a date as the month number.
SQL method matching Python's .title() is INITCAP() maybe it will help memorizing it as 'initial capital'.
Once again EXTRACT() to the rescue. Getting a day as a number EXTRACT(DAY FROM Start_Time) :: Numeric
When doing self joins, there's no need to alias the tables with AS. I can just create additional tables by 'tablename a' and 'tablename b'.
I need to work on INNER, OUTER, LEFT, RIGHT joins and their use cases.
To extract digits from a string I can use LEFT(col1, 5) where the 5 is representing the number of digits counting from left.
When joining tables where the linking columns have the same names USING() is probably the best option for that.
When filtering has to take place after grouping of data, it can be done with HAVING(). WHERE is used before grouping to filter the data to be grouped.
UNION gets rid of duplicates in the joined tables and UNION ALL keeps the duplicates.

Course Notes: Exploratory Data Analysis in SQL

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Notes On SQL

Notes On SQL