Workspace
Víctor Borbor/

Course Notes: Web Scraping in Python

0
Beta
Spinner

Course Notes

Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the datasets folder.

!pip install scrapy
# Import any packages you want to use here
import scrapy
from scrapy.crawler import CrawlerProcess

Take Notes

Add notes here about the concepts you've learned and code cells with code you want to keep.

Add your notes here

Suppose you want to scrape the title, author, and published date of the latest 10 articles on the front page of the New York Times (https://www.nytimes.com/). The page has the following HTML structure:

# html_beta = '''
# <section class="css-1fanzo5">
#     <ul class="css-11sadjy">
#         <li class="css-8atqhb">
#             <a class="css-13s9j9u" href="https://www.nytimes.com/2023/04/28/world/asia/covid-india-vaccines-modi.html">
#                 <article class="css-1cmu9py" data-testid="article-wrapper" ...>
#                     <h2 class="css-1dq8tca e1xfvim30">
#                         Covid-19 Vaccines: A Rare Uptick in India as the Death Toll Mounts
#                     </h2>
#                     <p class="css-m7vo8i e1xfvim31">
#                         By <span class="css-z8k6dx e1xfvim32">Sameer Yasir</span>, <span class="css-1echdzn e1xfvim33">Karan Deep Singh</span> and <span class="css-1echdzn e1xfvim33">Sanya Dosani</span>
#                     </p>
#                     <time class="css-4n4n7h e1xfvim34" datetime="2023-04-28T10:02:38.000Z">April 28, 2023</time>
#                     ...
#                 </article>
#             </a>
#         </li>
#         <li class="css-8atqhb">
#             ...
#         </li>
#         ...
#     </ul>
# </section>

# '''
# with open('index.html') as file:
#     lines = [x.strip() for x in file.readlines()]
#     html = ''.join(lines)
# sel = Selector(text=html_beta)
# titles = [x.strip() for x in sel.xpath('//h2/text()').extract()]
# authors = []
# for i in range(len(sel.xpath('//h2/text()').extract())):
#     article_authors = sel.xpath('//article//p[{}]/span/text()'.format(i+1)).extract()
#     authors.append(article_authors)
# dates = sel.xpath('//time/text()').extract()
# print(titles, authors, dates, "\n")

Wikipedia

Suppose we want to extract the following information from this HTML source code using Scrapy Python:

  • The title of the Wikipedia page
  • The first paragraph of the page content
  • The list of features of the Python programming language
# # Set the selector
# url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
# html = requests.get(url).content
# sel = Selector(text=html)
# # The first paragraph of the page content
# paragraphs = sel.xpath('//p[3]//text()').extract()
# p_clean_00 = [x for x in paragraphs if x != '/n']
# paragraph_00 = ''.join(p_clean_00)
# print(paragraph_00)
# # The first paragraph of the page content
# paragraph_01 = sel.xpath('//p[4]//text()').extract()
# p_clean_01 = [x for x in paragraph_01 if x != '/n']
# paragraph_01 = ''.join(p_clean_01)
# print(paragraph_01)
# # Features list
# features_list = paragraph_01.split('.')
# print(features_list[:-1])

Amazon Best Sellers

Suppose we want to extract the following information from this HTML source code using Scrapy Python:

  • The title of the Amazon Best Sellers page for Electronics
  • The names of the five best-selling electronic products on Amazon
  • The names of the five newest electronic products on Amazon
# # URL of the web pages
# bst_url = 'https://www.amazon.com/Best-Sellers-Electronics/zgbs/electronics'
# nw_url = 'https://www.amazon.com/gp/new-releases/electronics'
# # Get source code
# bst_html = requests.get(bst_url).content
# nw_url = requests.get(nw_url).content
# # Selector definition
# sel_bst = Selector(text=bst_html)
# sel_nw = Selector(text=nw_url)
# bst = sel_bst.xpath('//div').extract()
# print(bst[2])

Quotes

Create a Scrapy spider to scrape quotes from the website http://quotes.toscrape.com/. The spider should extract the following information from each quote:

  • Quote text
  • Author name
  • Tags associated with the quote