Workspace
Víctor Borbor/

Course Notes: Introduction to Natural Language Processing in Python

0
Beta
Spinner

Course Notes

Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the datasets folder.

# Import any packages you want to use here
import re
import urllib.request
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter
import numpy as np
import seaborn as sns

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('punkt')
nltk.download('stopwords')

Introduction to Regular Expressions

Emails

You have a string that contains a list of email addresses separated by commas. Your task is to extract all the email addresses from the string using regular expressions.

Example string: "John Doe [email protected], Jane Smith [email protected], Bob Johnson [email protected]"

Expected output: ["[email protected]", "[email protected]", "[email protected]"]

string = "John Doe <[email protected]>, Jane Smith <[email protected]>, Bob Johnson <[email protected]>"
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', string)
print(emails)

Phones

You have a string that contains a list of phone numbers separated by commas. Your task is to extract all the phone numbers from the string using regular expressions.

string = "John Doe: 555-1234, Jane Smith: 555-5678, Bob Johnson: 555-9012"
pattern = r'\d{3,}-\d{4,}'
output = re.findall(pattern, string)
print(output)

Weekends

Sure, here's another challenging practice problem using regular expressions:

You have a string that contains a list of dates in the format "YYYY-MM-DD". Your task is to extract all the dates that fall on a weekend (Saturday or Sunday) using regular expressions.

Example string: "2023-05-01, 2023-05-02, 2023-05-03, 2023-05-04, 2023-05-05, 2023-05-06, 2023-05-07"

Expected output: ["2023-05-01", "2023-05-07"]

import datetime

string = "2023-05-01, 2023-05-02, 2023-05-03, 2023-05-04, 2023-05-05, 2023-05-06, 2023-05-07"

dates = []

for date_str in string.split(", "):
    year, month, day = map(int, date_str.split("-"))
    date = datetime.date(year, month, day)
    if date.weekday() in [5, 6]:
        dates.append(date_str)

print(dates)

Vowels

You have a string that contains a list of words separated by commas. Your task is to extract all the words that start with a vowel using regular expressions.

Example string: "apple, banana, cherry, date, eggplant, fig, grapefruit"

Expected output: ["apple", "eggplant"]

string = "apple, banana, cherry, date, eggplant, fig, grapefruit"
pattern = r'\b[aeiouAEIOU][a-zA-Z]*\b'
print(re.findall(pattern, string))

.com

You have a string that contains a list of email addresses in various formats. Your task is to extract all the email addresses that end with ".com" using regular expressions.

Example string: "[email protected], [email protected], [email protected], [email protected]"

Expected output: ["[email protected]", "[email protected]"]

string = "[email protected], [email protected], [email protected], [email protected]"
output = re.findall(r'[a-zA-Z0-9._]+@[a-zA-Z]+\.[comCOM]{3}', string)
print(output)

Introduction to tokenization

UpperLower

You have a string that contains a list of sentences. Your task is to extract all the words that start with a capital letter and end with a lowercase letter using regular expressions and word tokenization.

Example string: "The quick Brown Fox, Jumps over the Lazy Dog. The Cat in the Hat is a classic children's book."

Expected output: ["The", "Brown", "Fox", "Jumps", "Lazy", "Dog", "The", "Cat", "Hat"]

More info: Regular Expressions in Python

string = "The quick Brown Fox, Jumps over the Lazy Dog. The Cat in the Hat is a classic children's book."
sentences = sent_tokenize(string)

words = []
for sentence in sentences:
    tokens = word_tokenize(sentence)
    for token in tokens:
        if re.match(r'^[A-Z][a-z]+$',token):
            words.append(token)

print(words)

Hyphen ✅