ISIA – Data Manipulation (2)

Uniform resource identifiers

  • A request for a resource or action optionally follows the address
  • scheme:[//[user[:password]@]host[:port]][/path][?query][#fragment]
  • Path names some hierarchical resource or action desired of the server
    • must begin with a slash ‘/’ (but cannot begin with two slashes ‘//’)
    • looks like a POSIX (BSD, MacOS, Linux) path (file) name
      • additional ‘/’ characters used to specify are source hierarchically, e.g:
        http://some.machine.net/blog/uploads/2018/05/02/isa04.pdf
  • Query permits adding key-value parameters to the request separated by &
    • path?key1=value1&key2=value2&…
  • e.g: http://www.google.com/search?q=tcp%2Fip
  • Fragment permits specifying a section or index within the resource
    • web browsers use the fragment to scroll the page to the desired location
    • note: in that case the fragment is interpreted by the client, not the server

More regular expression features in Python

  • Repetition, zero or more times: x*
    • yes* matches ye, yes, yessssssssssssssssssss, etc.
  • Repetition, one or more time(s): x+
    • no+ matches no, noooooooooo, etc.
  • Repetition, zero or one time(s): x?
    • too? matches to and too only

Grouping and repetition work together

import re
for string in ['0', '11', '-100', '+123', '-+-9aa', '+', '-', 'signed -42 number']:
	m = re.search("([-+])?([0-9]+)", string)
	if m: # regular expression found within string
		print(string, "=>", "sign", m.group(1), "magnitude", m.group(2))
	else:
		print(string, "=>", m)
0 => sign None magnitude 0
11 => sign None magnitude 11
-100 => sign - magnitude 100
+123 => sign + magnitude 123
-+-9aa => sign - magnitude 9
+ => None
- => None
signed -42 number => sign - magnitude 42

More regular expression features in Python

  • In a regular expression
  • ^ matches the beginning of the string
  • $ matches the end of the string
  • re.match() finds a match only at the start of the string
    • re.match("regex", s) is equivalent to re.search("^ regex", s)
  • To match a regex only at the end of the string
    • re.search("regex$", string)
  • To ensure the entire string matches
    • re.search("^regex$", string)
  • ^ and $ are called anchors
    • they don’t match characters, they match positions before or after the string itself

Weather information format

Aviation digital data service

  • Aggregates METAR weather reports from the whole world
  • METAR — meteorological aerodrome report
    • the most common format for the transmission of observational weather data
    • standardized by the International Civil Aviation Organization (ICAO)
    • METARs from anywhere can be understood by anyone
  • E.g., report on previous page:
    SLLP 061600Z 08008KT 9999 ////// 13/M11 Q1041
    station name, date&time (UTC), wind direction and speed, visibility, clouds, temperature and dew point, barometric pressure
  • Current world-wide reports available by HTTP
    • http://www.aviationweather.gov/adds/dataserver_current

Data request parameters

  • http://www.aviationweather.gov/adds/dataserver_current/httpparam?parameters
    • parameters follow the usual URI conventions
    • in particular, multiple parameters are separated with ‘&’ characters
  • E.g., for Osaka Itami (station code RJOO) http://www.aviationweather.gov/adds/dataserver_current/httpparam?
    dataSource=metars — the kind of information requested
    & requestType=retrieve — get data (rather than server status, etc.)
  • & format=csv — comma-separated, not XML (which is hard to parse)
    & stationString=RJOO — the station we want data from
    & hoursBeforeNow=36 — how far back to search (36 hours is about the limit)
    & mostRecentForEachStation=constraint — one result per station

Data response format

No errors
No warnings
3 ms
data source=metars
1 results
raw_text,station_id, …more column headings …
RJOO 151100Z 23005KT 190V260 CAVOK 22/11 Q1013 RMK A2992,RJOO, …

  • First six lines contain meta information about the query and response format data begins on line 7
    • everything up to the first comma is the METAR data
  • To scrape the response
    • remove the first six lines
    • for each remaining line
      • remove the first comma and everything following it
      • decode the (space-separated) fields

Scraping the response

  • The first field is always the station name: RJOO
  • The second field is always the time of the observation: 151100Z
    • day of month (two digits)
    • time of day (four digits)
    • the letter ‘Z’ indicating the timezone (always Zulu, aka GMT, aka UTC+0)
  • Some of the remaining fields might be missing
    • each field designed to be unambiguous
    • identify the field type using a regular expression
    • once the type is identified, extract the content from known positions
  • E.g., wind direction and speed: 23005KT
    • always five digits followed by KT
    • regular expression to match this field: [0-9][0-9][0-9][0-9][0-9]KT
    • direction is field[0:3]
    • speed is field[3:5]
  • Temperature / dew point (in centigrade degrees): 22/11
    • the temperatures are always two digits
    • the letter M is used instead of a negative sign
    • a temperature of 11°C and dew point −5°C would be encoded 10/M05
  • Barometric pressure: Q1013
    • regular expression: Q[0-9][0-9][0-9][0-9]
    • pressure is field[1:5]
  • Cloud information is a little harder to decode
    • SKC and CAVOK both mean the sky is clear
    • CLR means no clouds below 12,000 feet
    • OVC means overcast
    • FEW, SCT, BKN mean few, scattered or broken clouds
      • followed by three digits giving their altitude in 100s of feet
        BKN013 ⇒ broken clouds with bases at 1,300 feet
    • the information can be repeated for multiple cloud layers
      • FEW008 SCT010 BKN100
    • if any part of the information is missing, the three characters are replaced with ///
      BKN/// ⇒ broken clouds with unknown base altitude
      ////// ⇒ cloud information unknown

Weather (precipitation) information is most complex of all

Qualifiers and weather phenomena below:

moderate
+ heavy
- light
VC in vicinity
MI shallow
PR partial
BC patches
DR low drifting
BL blowing
SH shower(s)
TS thunderstorm
FZ freezing
DZ drizzle
RA rain
SN snow
SG snow grains
IC ice crystals
PL ice pellets
GR hail
GS small hail
UP unknown precipitation
BR mist
FG fog
FU smoke
VA volcanic ash
DU widespread dust
SA sand
HZ haze
PY spray
PO dust/sand whirls
SQ squalls
FC tornado
SS sand/dust storm

e.g.,
VCFU = smoke in the vicinity
-SHRA = light rain showers

Practical: obtaining real-time raw weather data

http://www.aviationweather.gov/adds/dataserver_current/httpparam?dataSource=metars&requestType=retrieve&format=csv&stationString=RJOO&hoursBeforeNow=36&mostRecentForEachStation=constraint

METAR
xxyyyyZ
xxxyyKT
VRB
xxKT
xxxVyyy
xx/yy
M
xx/Myy
9999
xxxx
CAVOK
SKC
CLR
OVC
FEW
xxx
SCT
xxx
BKN
xxx
Q
xxxx
A
xxxx
meaning
xxx day at yy:yy hours
wind 
xxx at yy kts
wind variable at 
xx kts
wind varying fromt 
xxx to yyy
temperature 
xx dew point yy
temperature -
xx dew point -yy
visibility unlimited
visibility 
xxxx m
ceiling/visibility OK
sky clear
sky clear below 12,000 feet
sky overcast
few clouds at 
xxx00 feet
scattered clouds at 
xxx00 feet
broken clouds at 
xxx00 feet
pressure 
xxxx mbar
pressure 
xxxx inHg

Exercise

Use regular expressions to find the above code. E.g., the first expression is:
^([0-9][0-9][0-9])([0-9][0-9])KT$