Course Notes
Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the datasets
folder.
# Import any packages you want to use here
import re
import urllib.request
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter
import numpy as np
import seaborn as sns
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('punkt')
nltk.download('stopwords')
Introduction to Regular Expressions
Emails
You have a string that contains a list of email addresses separated by commas. Your task is to extract all the email addresses from the string using regular expressions.
Example string: "John Doe [email protected], Jane Smith [email protected], Bob Johnson [email protected]"
Expected output: ["[email protected]", "[email protected]", "[email protected]"]
string = "John Doe <[email protected]>, Jane Smith <[email protected]>, Bob Johnson <[email protected]>"
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', string)
print(emails)
Phones
You have a string that contains a list of phone numbers separated by commas. Your task is to extract all the phone numbers from the string using regular expressions.
string = "John Doe: 555-1234, Jane Smith: 555-5678, Bob Johnson: 555-9012"
pattern = r'\d{3,}-\d{4,}'
output = re.findall(pattern, string)
print(output)
Weekends
Sure, here's another challenging practice problem using regular expressions:
You have a string that contains a list of dates in the format "YYYY-MM-DD". Your task is to extract all the dates that fall on a weekend (Saturday or Sunday) using regular expressions.
Example string: "2023-05-01, 2023-05-02, 2023-05-03, 2023-05-04, 2023-05-05, 2023-05-06, 2023-05-07"
Expected output: ["2023-05-01", "2023-05-07"]
import datetime
string = "2023-05-01, 2023-05-02, 2023-05-03, 2023-05-04, 2023-05-05, 2023-05-06, 2023-05-07"
dates = []
for date_str in string.split(", "):
year, month, day = map(int, date_str.split("-"))
date = datetime.date(year, month, day)
if date.weekday() in [5, 6]:
dates.append(date_str)
print(dates)
Vowels
You have a string that contains a list of words separated by commas. Your task is to extract all the words that start with a vowel using regular expressions.
Example string: "apple, banana, cherry, date, eggplant, fig, grapefruit"
Expected output: ["apple", "eggplant"]
string = "apple, banana, cherry, date, eggplant, fig, grapefruit"
pattern = r'\b[aeiouAEIOU][a-zA-Z]*\b'
print(re.findall(pattern, string))
.com
You have a string that contains a list of email addresses in various formats. Your task is to extract all the email addresses that end with ".com" using regular expressions.
Example string: "[email protected], [email protected], [email protected], [email protected]"
Expected output: ["[email protected]", "[email protected]"]
string = "[email protected], [email protected], [email protected], [email protected]"
output = re.findall(r'[a-zA-Z0-9._]+@[a-zA-Z]+\.[comCOM]{3}', string)
print(output)
Introduction to tokenization
UpperLower
You have a string that contains a list of sentences. Your task is to extract all the words that start with a capital letter and end with a lowercase letter using regular expressions and word tokenization.
Example string: "The quick Brown Fox, Jumps over the Lazy Dog. The Cat in the Hat is a classic children's book."
Expected output: ["The", "Brown", "Fox", "Jumps", "Lazy", "Dog", "The", "Cat", "Hat"]
More info: Regular Expressions in Python
string = "The quick Brown Fox, Jumps over the Lazy Dog. The Cat in the Hat is a classic children's book."
sentences = sent_tokenize(string)
words = []
for sentence in sentences:
tokens = word_tokenize(sentence)
for token in tokens:
if re.match(r'^[A-Z][a-z]+$',token):
words.append(token)
print(words)
Hyphen ✅