Extract Zotero Item Key from Local SQlite Database using Python

Background and Problem

Each data entry in the Zotero citation management system has a unique alphanumerical identifier, the Zotero Item Key. This key is needed, e.g., to provide in-text citations on WordPress sites using Zotpress, a Zotero powered plugin for managing citations and bibliography.

Unfortunately, there is no direct way to access the Zotero item key from the stand-alone desktop version or from the account page on the Zotero website. Zotpress lists citations with the corresponding key, but gets impossible to navigate with growing number of citations.

Although I encountered this issue recently, there are forum entries and blog posts dating back to 2012 with the desire too extract Zotero item keys (Give Me the Zotero Item Keys!, 2012; Zotero Item URIs from Client, 2013).

Quick Feasibility Study

Entering the Zotero item key as a search term in the desktop application reveals the corresponding bibliography entry. This tells me that the information is stored somewhere in the local database. Zotero’s documentation details information about the Zotero database and where to find it for different operating systems and Zotero versions (The Zotero Data Directory). The database name is given as zotero.sqlite, so the data is stored in SQlite format.

This seems all I need to obtain the item key on a local machine.

Data Extraction from Local SQlite Database Using Python

Python comes with many pre-installed modules, one of them being the sqlite3 for connecting to SQlite databases. Next step was to identify the database entry holding information about the item key, paper or book title, and author names. I used the multi-platform open source software DB Browser for SQLite to view the database content. The Zotero item key can be found in multiple entries, but one entry in particular seemed most useful, as it included all information needed for this project (item key, publication title, author and editor names, and other metadata) and the keys are needed for the Zotpress plugin, which relies on synched data: the table name syncCache. An example can be expanded below. The database cell is stored in JSON format and can be processed as dict format in Python.

Click to view database cell entry example

One example entry is as follows (item key, user id, and username changed):

{
    "key": "I2EA5679",
    "version": 23,
    "library": {
        "type": "user",
        "id": 1234567,
        "name": "yourusername",
        "links": {
            "alternate": {
                "href": "https://www.zotero.org/yourusername",
                "type": "text/html"
            }
        }
    },
    "links": {
        "self": {
            "href": "https://api.zotero.org/users/1234567/items/I2EA5679",
            "type": "application/json"
        },
        "alternate": {
            "href": "https://www.zotero.org/yourusername/items/I2EA5679",
            "type": "text/html"
        }
    },
    "meta": {
        "creatorSummary": "Helbing et al.",
        "parsedDate": "2019",
        "numChildren": 0
    },
    "data": {
        "key": "I2EA5679",
        "version": 23,
        "itemType": "bookSection",
        "title": "Will Democracy Survive Big Data and Artificial Intelligence?",
        "creators": [
            {
                "creatorType": "editor",
                "firstName": "Dirk",
                "lastName": "Helbing"
            },
            {
                "creatorType": "author",
                "firstName": "Dirk",
                "lastName": "Helbing"
            },
            {
                "creatorType": "author",
                "firstName": "Bruno S.",
                "lastName": "Frey"
            },
            {
                "creatorType": "author",
                "firstName": "Gerd",
                "lastName": "Gigerenzer"
            },
            {
                "creatorType": "author",
                "firstName": "Ernst",
                "lastName": "Hafen"
            },
            {
                "creatorType": "author",
                "firstName": "Michael",
                "lastName": "Hagner"
            },
            {
                "creatorType": "author",
                "firstName": "Yvonne",
                "lastName": "Hofstetter"
            },
            {
                "creatorType": "author",
                "firstName": "Jeroen",
                "lastName": "van den Hoven"
            },
            {
                "creatorType": "author",
                "firstName": "Roberto V.",
                "lastName": "Zicari"
            },
            {
                "creatorType": "author",
                "firstName": "Andrej",
                "lastName": "Zwitter"
            }
        ],
        "abstractNote": "",
        "bookTitle": "Towards Digital Enlightenment",
        "series": "",
        "seriesNumber": "",
        "volume": "",
        "numberOfVolumes": "",
        "edition": "",
        "place": "Cham",
        "publisher": "Springer International Publishing",
        "date": "2019",
        "pages": "73-98",
        "language": "en",
        "ISBN": "978-3-319-90868-7 978-3-319-90869-4",
        "shortTitle": "",
        "url": "http://link.springer.com/10.1007/978-3-319-90869-4_7",
        "accessDate": "1970-01-01",
        "archive": "",
        "archiveLocation": "",
        "libraryCatalog": "DOI.org (Crossref)",
        "callNumber": "",
        "rights": "",
        "extra": "DOI: 10.1007/978-3-319-90869-4_7",
        "tags": [],
        "collections": [],
        "relations": {},
        "dateAdded": "1970-01-01",
        "dateModified": "1970-01-01"
    }
}

There is a variation—and potentially others—depending on the data entry in the Zotero app. One relevant to this project is how author names are stored. If first and last names are stored independently, the keys for the dictionary entries are datacell["data"]["creators"]["firstName"] and datacell["data"]["creators"]["lastName"], respectively, whereas names stored in one cell can be retrieved by datacell["data"]["creators"]["name"].

(Assuming the database cell is loaded in JSON format using Python as variable datacell)

Assuming the database cell is loaded in JSON format using Python as variable datacell, the following are some interesting entries:

  • datacell["key"] to retrieve the item key
  • datacell["data"]["title"] to retrieve publication title
  • datacell["data"]["creators"] to retrieve list of authors and editors
  • datacell["meta"]["creatorSummary"] to retrieve reference style author mention

Project Framework

The goal of this project was to be able to retrieve a Zotero item key by either searching for part of a publication title or author name.

The database should be copied to a separate location. Avoid putting the Python script directly into the Zotero directory with the SQlite database or pointing the database connection to the Zotero home directory.

Target Dictionary Entries

This method uses the proper way of targeting dictionary keys and their value. As the database table cell is formatted in a way that can be parsed, any other attempts to extract information should be avoided (e.g., regular expressions on parsed data such as HTML, XML, an JSON, amongst others). The JSON formatted string needs to be read as a dictionary, which requires the json module. The connection to the database and the cursor to traverse it are opened the following way:

import sqlite3
import json

# non-fuzzy search of entire database cell entry
searchTerm = input("Enter search term (part of title, author, or item key): ").lower()

# standard database connection
conn = sqlite3.connect('/Users/username/yourPathToAwesomeProjects/zotero.sqlite')
cur = conn.cursor()

# access to database table "syncCache" for data retrieval
sCacheData = cur.execute("SELECT data FROM syncCache").fetchall()	# meta data retrieval, including item key, title, author name

The item key can be directly extracted by dict_entry.get("key"), others, such as publication title are in nested format and need two or three level extraction, e.g., dict_entry.get("data").get("creators") or dict_entry.get("data").get("creators").get("creatorType").

findFlag = 0
for tbl_entry in sCacheData:
	if searchTerm in tbl_entry[0].lower():	# crude text search of entire dictionary, including dict keys, not case sensitive
		findFlag = 1
		dict_entry = json.loads(tbl_entry[0])	# convert str to dict
		itemKey = dict_entry.get("key")
		titleString = dict_entry.get("data").get("title")	# paper title retrieval
		authorString = dict_entry.get("data").get("creators")	# fetch full list of authors and editors
		if authorString is not None and titleString != "":
			print("\n---\nPaper title:", titleString)
			for authorItem in authorString:
				foundAuthor = 0
				if authorItem.get("creatorType") == "author" and authorItem.get("firstName") is not None:
					print(authorItem.get("firstName"), authorItem.get("lastName"))
					foundAuthor = 1
					break
				elif authorItem.get("creatorType") == "author" and authorItem.get("name") is not None:
					print(authorItem.get("name"))
					foundAuthor = 1
					break
			if foundAuthor != 1:
					print(authorItem)	# fail-safe, last author or editor name
			print("Item key:", itemKey)
if not findFlag:
	print("Try a different term.")

The Zotero app allows to enter author and editor names in two patterns, (full name) and (first name, last name). The data is stored differently, based on the pattern and can be mixed (different for each author or editor) in one publication entry.

The final two lines close the cursor and connection to the database.

cur.close()
conn.close()

If the search term hits, the output for the example publication would be as follows:

---
Paper title: Will Democracy Survive Big Data and Artificial Intelligence?
First author: Dirk Helbing
Item key: I2EA5679

Conclusions

To avoid any unintended behavior with the Zotero desktop app, the SQlite database should be copied to a separate location together with the Python script.

When running the Python script, it will prompt the user for an input string. The search will not be case sensitive, but needs an exact match otherwise.

All resulting matches will be listed with publication title, one author, and the Zotero item key.

As the search is performed in a crude way on the whole synchCache table cell, words such as archive, attachment, or children will result in listing all publication entries. Incidentally, the search for the term all will also list all entries.