ISIA – Data Manipulation (1)

Information gathering

Example: Transcribing from physical structures

Stone rubbing — transcription of physical features

Example: Transcribing music by ear

  • Mozart (aged 14), transcription from memory
    • heard Allegri’s Miserere once
    • memorized it
    • transcribed it

Example: Taking lecture notes

Note taking — transcription in real-time

Example: Copy and paste

Manual bulk transcription of text

Example: Automatic scraping: optical character recognition

Automatic bulk transcription of selected texts

Example: Whole Internet scraping: the Internet archive

Wayback machine — bulk transcription of all 327 billion web pages

Example: Search engine robots (aka ‘spiders’)

59 search engines crawling the web:
13TABS, 360Spider, AntBot, Apexoo, Applebot, ArielisBot, Baiduspider, Barkrowler, Cliqzbot, Daumoa, DeuSu, DuckDuckBot, Elefent, Exabot, FemtosearchBot, Gigabot, GoogleBot, IDBot, IstellaBot, KD, KOCMOHABT, Laserlikebot, Mail.Ru, MetaJobBot, MojeekBot, NaverBot, PDFDriveCrawler, Plukkie, Qwantify, SOLOFIELD, SauceNAO, SearcH, Seeker, SeznamBot, SnowHaze, Spider, Swobblspider, TarmotGezgin, TeeRaidBot, TinEye, Toweyabot, WBSearchBot, Wotbox, Yahoo!, YandexBot, YisouSpider, auskunftbot, bingbot, coccocbot, exif-search, freefind, glindahl-cocrawler, iqdb, omgilibot, parsijoo-bot, psbot, sogou, spider, vebidoobot, yacybot, yoozBot
https://udger.com/resources/ua-list/crawlers?c=1

Pages analysed for search terms, pictures, videos, …

Web scraping

  • Data scraping
    • a computer program extracts data from human-readable output coming from another program
  • Web scraping
    • when the output is from a web server
  • Web pages intended for display by a browser, not a human
    • formatted as hyper text markup language (HTML)
    • overall structure is regular, hierarchichal, and easily analysed
  • Page content is often plain text
    • has to be extracted from the surrounding HTML structure
    • unstructured and difficult (for non-humans) to understand
      • this difficulty is what distinguishes scraping from parsing
    • unreliable and error-prone
      • ad-hoc recognition of text is necessary
      • surrounding HTML structure can (and often does) change unpredictably

Typical HTML structure

<!DOCTYPE html>
<html lang="en" dir="ltr">
	<head>
		<meta charset="UTF-8" />
		<title>Frank Zappa - Wikipedia</title>
		<script> JavaScript source code here </script>
		<link rel="stylesheet" href="pagestyle.css" />
	</head>
	<body>
		<div class="style-info-here">
			Content meant for large document divisions (sections, sidebars, menus).
			Content meant to <span> class="emphasis"appear inline</span>,
			such as a phrase spanning part of a sentence.
			Graphics <img src="picture-name" /> and
			<a href="http://some.other.site/index.html">web links</a>
			can be specified<!-- or commented out-->.
		</div>
	</body>
</html>
  • Structure is mostly hierarchical and ‘parenthesised’ (<tag> … </tag>), except that
    • some tags (<!DOCTYPE …>) are never closed
    • some opening tags (<meta, <link, <img) close themselves immediately (‘/>’)
    • some tags (<!-- -->) close themselves with non-standard syntax
      • and can contain arbitrarily-complex HTML within the tag itself
    • much of the hand-written HTML on the web is syntactically broken

Practical: scraping Wikipedia pages

  • Requests for Wikipedia pages are easy to construct
    • http://en.wikipedia.org/wiki/name-of-page
  • For people, the page names are usually ‘FirstName_LastName’
    • http://en.wikipedia.org/wiki/Frank_Zappa

Easy to write a Python program to fetch the page content. Install bs4 for Python 3

from bs4 import BeautifulSoup
import requests

http_request = requests.get('https://en.wikipedia.org/wiki/Special:Search?search=jams+cameron')
soup = BeautifulSoup(http_request.text, "html.parser")
print(soup.prettify())

Content of pages easily extracted using regular expression matching

Python regular expression matching

import re
m = re.search("regularexpression", string)
if m: # regular expression found within string

Where regular expression can contain

literal characters, which match themselves

re.search("abc", "hello") ⇒ None # aka False
re.search("ell", "hello") ⇒ a match object # aka True

Character sets, which match any one of the members

re.search("e[xyz]l", "hello") ⇒ None
re.search("e[klm]l", "hello") ⇒ a match object

parentheses, for grouping, which also save the matched substring

re.search("e([xyz])l", "hello") ⇒ None
re.search("e([klm])l", "hello") ⇒ a match object
re.search("e([klm])l", "hello").group(1) ⇒ "l"
re.search("(e[klm]l)", "hello").group(1) ⇒ "ell"

e.g., to extract an ISO-formatted date (YYYY-MM-DD) from anywhere within a string

m = re.match("([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])", string)
if m: print(m.group(1))
else: print("no date found")

Scraping Wikipedia for birthdays

Pages about people usually include birthday, in a predictable format

  <span style="display:none">(<span class="bday">1940-12-21</span>)</span>December 21, 1940<br />
  • two steps to extract the desired data
  1. select only the line containing class=”bday”
  2. try to find a date in the selected line, formatted YYYY-MM-DD

Using a Python regular expression to extract the data

import re
if ’class="bday"’ in line:
bday = re.match(">([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])<", line)
if bday: print(bday.group(1))
else: print(line) # better than nothing

Using specialized modules can target classes and HTML tags directly.

	print(soup.find_all(class_="bday")[0].get_text())
	print(soup.find(class_="bday").get_text())

Web scraping summary

  • Fetch interesting web page(s) in HTML using HTTP
    • for small-scale, manually-driven download (wget, curl, or write a program)
    • for large data gathering, write a ‘bot’ program that ‘crawls’ many web sites
  • Extract interesting data
    • parse the HTML and/or plain text that has well-defined structure
    • search for keywords, patterns: names, telephone, company information, URLs, etc.
    • reformat the data if necessary before saving
    • follow links found in the data (recursive web crawling)
  • A few typical uses
    • web indexing (Google, Bing, etc.)
    • web and data mining
      • trending products, product reviews, housing rentals/sales, etc.
      • earthquake, weather, etc., monitoring
    • price monitoring and retail recommendation (kakaku.com, etc.)

Newer forms of scraping: data feeds

  • Client-server
    • server produces data, client consumes data
  • Traditional method: receiver pull
    • client ‘polls’ server, asking for data
    • communication happens even when data has not changed
  • Newer model: sender push
    • client ‘subscribes’ to interesting data
    • server pushes data to client whenever available or changed
  • Newer formats make scraping easier
    • the interesting data is meant to be used by a ‘web app’
    • encoded in JavaScript Object Notation (JSON); client uses a published ‘web API’
    • e.g., Google services (maps, etc.)
Sender Pull
Sender Push