Notes On SQL
On casting in SQL.
- What's the difference with casting an 'integer' vs 'numeric'?
- CAST() is a way to modify a data type within the executed query. It will not cascade the commands down to the database. Shorthand in SQL for CAST() is ::. I can see this being handy when building queries where I need to make sure data types match.
- What's the calculation that runs behind the scenes for stddev() function?
On coalesce.
- With COALESCE(col1, col2) I can fill missing values in col1 with col2's values.
Reminder for myself on subqueries, whether they're used in FROM or SELECT, they can only return one column. That means, I can compute information based on many columns, but the result can only be a single column.
New functions and revisting an old one.
- trunc() will let me cut down the number of digits in the value, similar to round(), but it doesn't move the value up or down based on rounding rules.
- generate_series() is similar to range() in python. I can use it to generate a range of ordered numbers in a column. It has the same arguments as well, from - to - step.
- WITH [table_name] AS (SELECT * FROM table) did cause a bit of a struggle. WITH AS is a way to build a table that can be used to summarise data, without printing the results. This summary data can then be used in the core part of the query. How do I use this? (More complete example needed.)
- corr() is a function that can be used to determine the relationship between two variables. Correlation is indicated by 0 to 1, where close to 1 indicates a stronger correlation and closer to 0 a weaker correlation between the two variables. It takes in two argumenst corr(arg1, arg2) where the arguments are the columns to be compared.
- Median and Mean in SQL is a value that needs to be computed = percentile_disc() or percentile_cont() WITHIN GROUP (ORDER BY col) FROM table. There needs to be an operation to order the results. Important to remember that the percentile_disc() and precentile_cont() requires 0.5 as the argument to compute the mean or median. The number provided indicates to the algorithm what portion of the values to include in the calculation.
- Temporary tables. Removing existing DROP TABLE IF EXIST x and CREATE TEMP TABLE [xyz] AS /n SELECT() FROM
On working with text data.
- Text types are varchar and text.
- Filtering text data with LIKE and ILIKE. Latter ignores the case of the string supplied inside '%x%'. Former matches the case. % is 0 or more characters, including spaces.
- trim() can be used to remove characters or spaces from text data. prefixing trim() with r or l will remove characters from the beginning or end of the string. trim(col, '0123abcd') takes the column or value(s) as the first argument and the characters to be removed as a second argument.
- lower() and upper() will set the case on the variable or column.
- split_part(col/value, arg1, arg2) takes three arguments. It splits the provided col or value on the character arg1, using it as the delimiter, arg2 tells which part of the split string to return?
- Tables can be updated with UPDATE [tablename] SET [col]=[value] WHERE [col]=[value]
On working with dates.
- I've been using three different functions so far.
- date_part() takes in two arguments, interval of the date an date column/value/field. In that order. date_part('day', date_col). It returns only that part of the date, scrapping the rest of the data. This is good for memory related reason?
- date_trun() behaves the same way, except, it returns the requested interval and retains the information of the date outside of the interval.
- EXTRACT(DOW FROM date_col), first argument tells the interval to extract from the date_col.
Mid assesment as a practice run: Q1: Return the rows that appears both in the movie_2000 and movie_2010 table. A: Intersect will return the rows from two tables with matches. Q2: Return the records if they have duplicates from the vendors table. A: Duplicate counting in SQL is done by using GROUP BY and COUNT() functions. Q3: From the vendors table, return the top 5 states with the largest number of vendors. A: Should have known this. Think more about the result vs what's already there. Here I only had to summarise data. Q4: From the fruit_2022 table, extract the month from the date column and convert the month to the NUMERIC data type. A: I used date_trunc() to extract the month, but it doesn't seem to work when casting to NUMERIC. In these situations I should use EXTRACT(MONTH FROM date_col)::NUMERIC Q5: From the movie_budget table, return the year with more than 3 movies in the table. A: Same as Q3, repeated mistake here. Q6: For each vendor_name in the vendors table, ensure that only the first letter of each word is upper case e.g. DATACAMP would become Datacamp. A: INITCAP() is a new function for me. Seems to do the same as .title() in Python. Q7: Add the rows from the movie_2010 table to movie_2000 but keep the duplicates. A: UNION ALL. What's the difference between UNION ALL and UNION? Q8: Return the number of duplicates as long as they are the same in the name and city column. A: Same as Q2. Reminder also to think about which columns need to be in the GROUP BY clause for the query not to result in an error. Q9: Add the rows from the movie_2010 table to movie_2000 but remove the duplicates. A: UNION. Here's my answer as well. UNION will not create duplicates, whereas UNION ALL will do so.
Summary of the mistakes.
- Duplicate functions are 'manual' in SQL. Need to group the rows and count occurrences per category.
- One new function to capitalise first letter of a word.
- Data casting has some limitations with date_trunc vs. EXTRACT.
- Merging tables, need to been in mind what INTERSECT vs. UNION vs. UNION ALL will return.
Skills assessment 'Data Management in SQL (PostgreSQL)' notes:
- For combining string data between two columns in SQL use CONCAT(col1,col2,' add anyother content')
- In order to get specifics out of a date use EXTRACT(month FROM date::DATE). That example takes the month from a date as the month number.
- SQL method matching Python's .title() is INITCAP() maybe it will help memorizing it as 'initial capital'.
- Once again EXTRACT() to the rescue. Getting a day as a number EXTRACT(DAY FROM Start_Time) :: Numeric
- When doing self joins, there's no need to alias the tables with AS. I can just create additional tables by 'tablename a' and 'tablename b'.
- I need to work on INNER, OUTER, LEFT, RIGHT joins and their use cases.
- To extract digits from a string I can use LEFT(col1, 5) where the 5 is representing the number of digits counting from left.
- When joining tables where the linking columns have the same names USING() is probably the best option for that.
- When filtering has to take place after grouping of data, it can be done with HAVING(). WHERE is used before grouping to filter the data to be grouped.
- UNION gets rid of duplicates in the joined tables and UNION ALL keeps the duplicates.