ISIA – Data Storage (1)

Unstructured data

  • Many kinds, often linear
    • sound, image, video, text, …
    • sequence may be important, or not: /usr/share/dict/words
    • typical operation: search (e.g., by regular expression)
      • does data contain a particular pattern?
        re.match("GAATTC|CTTAAG", sequence)
  • Sequence may exist only to make search more efficient
    • ordering ⇒ binary search

Hierarchical data

  • Larger or smaller units, often context-dependent
    • addresses, names (〒525-8577 滋賀県 草津市 野路東1丁目1-1 立命館大学)
    • dates, times (2018/05/22 15:16:42)
    • library catalogues (005.130.1)

Context-dependency implies a tree structure

Meta data

  • Information added to (or outside of) the actual data itself
    • describes characteristics of the actual data – format, meaning, origin, usage rights
      • information about the data retrieval
    • e.g., column names in CSV files raw_text,station_id,observation_time,latitude,longitude
  • Can be used to index the data
    • track positions on a CD
    • meta ‘tags’ (ID3) in digital music files
    • chapter positions in DVD/Blu-Ray
    • file names in compressed zip archive
  • Sometimes included at the start or end of the data itself
    • books, CDs, etc., begin with table of contents
    • zip files end with list of the files in the archive
      • why at the end?

Data processing

analysis of information, report generation, visualisation, etc.
• performed in-memory
• data structures often parallel the algorithms used
– e.g., frequently insert/remove in ordered data ⇒ binary tree dictionaries
• associate keys with values
– key is meta-data describing data – value is the data itself
• catalogues and directories

entry = {
	name: "smith",
	room: "AB007",
	phone: "1234",
	email: "foo@ bar.example.com"
}

directory = [ entry1, entry2, entry3, ... ]
KeyValue
namesmith
roomAB007
phone1234
emailfoo@bar.example.com

But… how do we search for a particular entry?

Searchable data structures

  • Choose one key from the entries to use for searching
    • the associated values must be unique create another dictionary whose
    • keys are the values of your search key in the entries
    • values are the entries themselves
entries = [
	{ ’name’: ’jack’, ’room’: 1, ’extension’: 6666 },
	{ ’name’: ’jill’, ’room’: 2, ’extension’: 6642 },
	{ ’name’: ’mickey’, ’room’: 6, ’extension’: 6996 },
	{ ’name’: ’minnie’, ’room’: 8, ’extension’: 9669 }
]
room = {}
for e in entries: room[e["room"]] = e
# look up information based on room number
print(room[6]) #=> { name: mickey, room: 6, extension: 6996 }
print(room[6]["name"]) #=> mickey
name = {}
for e in entries: name[e["name"]] = e
# look up information based on occupant name
print(name["jill"]) #=> { name: jill, room: 2, extension: 6642}
print(name["jill"]["room"]) #=> 2

Data storage: databases

  • Generalizes the idea of dictionary, storing data in tables
    • two-dimensional array of data
    • rows are entries, columns are values
    • search ‘key’ can be any unique column value
      • or a combination of several values that make a unique identifier for each row
    • table searched by selecting rows
      • based on relationships between values and/or constants
  • Local or remote storage of structured data
    • communication over a network is typical
    • optimized for complex searches and multiple results
    • speed depends on server
    • often too powerful for data acquisition applications
mysql> use directories;

mysql> select * from rooms;
+--------+------+-----------+
| name   | room | extension |
+--------+------+-----------+
| jack   | 1    | 6666      |
| jill   | 2    | 6642      |
| mickey | 6    | 6996      |
| minnie | 8    | 9669      |
+--------+------+-----------+

Data storage: single file

  • One file logs all the data
    • e.g., logs of interesting events on your computer
      • take a look in /var/log/system.log (or /var/log/syslog)
    • optimized for speed of writing
      • open file, append message, close file
    • implicitly sorted by date
      • later entries are later in time
      • each item stamped by date, so we know when interesting things happened
    • logs of events become stale (events are no longer interesting)
      • logs rotated every day to compress older ones, delete very old ones
May 22 16:30:06 zora.local UserEventAgent[44]: Captive: en0: Maintaining ’Rits-Webauth’
May 22 16:30:06 zora.local acvpnagent[50]: A new network interface has been detected.
May 22 16:30:06 zora.local InternetSharing[5020]: en0, started "natpmpd"
May 22 16:30:06 zora.local InternetSharing[5020]: configd: com.apple.NetworkSharing.broadcast-2 has been started
May 22 16:30:06 zora.local mDNSResponder[99]: SetupDNSProxySkts: 63, 76, 107, 111
May 22 16:30:06 zora.local InternetSharing[5020]: dns proxy successfully enabled
May 22 16:30:06 zora.local UserEventAgent[44]: Captive: en0: Redirect detected on ’Rits-Webauth’
May 22 16:30:06 zora.local UserEventAgent[44]: Captive: CNPluginHandler en0: Authenticating (__BUILTIN__)
May 22 16:30:06 zora.local UserEventAgent[44]: Captive: CNPluginHandler en0: PresentingUI (__BUILTIN__)
May 22 16:30:06 zora kernel[0]: en0: BSSID changed to a0:cf:5b:c2:2d:8f
May 22 16:30:06 zora kernel[0]: en0: channel changed to 44

Review: network communication

  • To identify a server on the Internet you need
    • an IP address (to identify the remote machine)
    • a port number (to identify a specific communication endpoint on that machine)

Review: data cleaning

  • Re-format data for readability
    • regular expressions
    • string manipulation
  • aString.split()
    • split a string into individual words
    • discards white space (blanks, newlines) between words
    • returns a list of strings
  • " ".join(listOfStrings)
    • joins a list of strings into a single string
    • uses the specified character (e.g., space " ") to separate adjacent elements

Review: data cleaning

Data:

Q:      What does it say on the bottom of Coke cans in Osaka?
A:      Open other end.

Cleaning:

" ".join(data.split()) # replace all white space with single spaces

Result:

Q: What does it say on the bottom of Coke cans in Osaka? A: Open other end.

Review: timestamps

To obtain an object representing the current date and time:

import datetime
now = datetime.datetime.now()

Convert to ISO standard date and time format:

now.isoformat() #=> 2018-05-23T10:23:15.056952

Convert to your own format:

now.strftime("%Y/%m/%d") #=> 2018/05/23
now.strftime("%H/%M/%S") #=> 10/23/15

(To see a full list of ‘%’ format characters for dates and times, type ‘man strftime’ in a terminal)

Review: using files

with open(fileNameString, modeString) as file_content_variable:
	file_content_variable.write(aString)

modeString is
r – open the file for reading only
w – open the file for writing and truncate it (remove any existing contents)
a – open the file for writing and append to it (preserving existing content)

  • file_content_variable.write(aString)
  • Writes aString to the file, without a newline
    • append a “\n” to aString if you need a newline

Data storage: flat files

  • The name is meant to imply ‘not in a database’ local or remote storage of structured data
    • entries split into many individual files
      • each file may or may not be appended to, like a log file
    • path name contains hierarchical context of each file
      • e.g., pets.blogs.com/wp/uploads/2018/05/22/kittens.jpg
    • optimized for string-based (file/directory name) hierarchical search
    • speed depends on file system
    • typically very fast, since searching directories is a critical OS service
    • search for and open a file for reading 1,000,000 times:
      ≈ 0.2 seconds (C)
      ≈ 1.9 seconds (Python)

Practical (logging a random quote)

logfile = "logging-pull.txt"

while 1:
	http_request = requests.get('http://www.quotationspage.com/random.php')
	soup = BeautifulSoup(http_request.text, "html.parser")
	qcontent = soup.find(class_="quote").get_text()
	qcontent = qcontent.replace("\n", " ")
	data = qcontent
	stamp = datetime.datetime.now().strftime("%D - %T")
	with open(logfile, "a") as file_content_variable:
		file_content_variable.write("{0}: {1}\n".format(stamp, data))
	print("Data logged.")
	time.sleep(15)

The code above needs the following modules to be loaded: BeautifulSoup, requests, string, datetime, time