What is Web Scraping?

Today we are going to see, How to do web scraping using Python?

Extracting or fetching content/data from a website/websites can be called as Web Scraping. It is an automated, programmatic process by which data can be constantly scraped (extracted) from webpages.

We can extract underlying HTML (Hyper Text Markup Language), along with data stored using web scraping. It is also known as Web Harvesting. It can provide instant data from any publicly accessible webpages.

Note : web scraping may be illegal for some websites.

Scraping with Curl

#imports
from lxml import etree
from subprocess import Popen,PIPE
from IO import StringIO

#downloding
user_agent = ''  #agent information browser details
url = '' #url that is to be scraped
get = Popen(['curl','-s','-A',user_agent,url],stdout = PIPE)  # '-s' : silent download, -A : user agent flag
result = get.stdout.read().decode('utf8')

#parsing
tree = etree.parse(StringIO(result),etree.HTMLParser())
divs = tree.xpath('//div')

Web content download using urllib.request

urllib.request is an standard library module that can be used to download web content.

from urllib.request import urlopen

res = urlopen('http:/www.example.com/')
data = res.read()

encoding = res.info().get_content_charset()  # to get character set / encoding of received response
html = data.decode(encoding) # decode received response

Scraping using the Scrapy Framework

Scrapy is famous framework used for web scraping.

First we have to set up new Scrapy project.

scrapy startproject scrapingTest

To scrape we need to create spider. Spiders are used to define how certain site will be scraped.

import scrapy

class newSpider(scrapy.Spider):
      name = ''  #unique name of spider
      start_urls = [''] #start of parsing you can pass set of urls from where parsing should start
      def parse(self, response):
          for href in response.css('.class h2 a::attr(href)'): #find urls using css selector and html tags
              full_url = response.urljoin(href.extract())
              #for each request this yields, response is sent to parse_question
              yield scrapy.Request(full_url,callback=self.parse_question) 
      def parse_question(self, response):
          yield {
              'title' : response.css('h1 a::text').extract_first(),
              'body' : response.css('.class .post-text').extract_first(),
              'tags' : response.css('.class .post-tag::text').extract(),
              'links' : response.url,
          }
               

save your spider inside scrapingTest\spiders\testspider.py.

Now use your spider for scraping.

For running use following in project directory

scrapy crawl example

Sometimes default scrapy agents is blocked by the host. For changing the default user agent open setting.py and provide USER_AGENT that you want.

Scraping using Selenium WebDriver

Some websites don’t like to be scraped. For scraping such websites you may need to present yourself as real user working with website. Selenium launches and controls a web browser.

Selenium can modify browser’s cookies. It can fill forms, take screenshots, simulate mouse click and run custom js.

from selenium import webdriver

browser = webdriver.Firefox() #launch Firefox browser
browser.get('http://www.example.com/') #load url

title = browser.find_element_by_css_selector('h1').text #page title

results = browser.find_elements_by_css_selector('.class')

for result in results:
    result_title = browser.find_elements_by_css_selector('.class h3 a').text
    result_excerpt = browser.find_elements_by_css_selector('.class .excerpt').text

print (result_title, result_excerpt)

More on upcoming posts.

Hope this helps!

Happy Learning :>

Leave a Reply