Creating log of played songs from multiple iTunes XML files

Now that I have the complete set of iTunes files in .xml form, I can parse all the files and determine the historical playlist by merging the results. I will store the overall playlist in an array called playlist, and a structure of track details in a dictionary called master.

Data in each of the XML files needs to be converted to a dictionary that I can use more readily. To parse the files I use the following includes:

import xml.etree.ElementTree as ElementTree

I also do some other importing and define a few constants:

import datetime, getpass, os
import dateutil.parser, dateutil.tz

username = getpass.getuser()

src_directory = "/Users/{}/Music/iTunes/Previous iTunes Libraries".format(username)

persistent_id = 'persistent_id'
track_id = 'track_id'
last_played_key = "play_date_utc"

tzutc = dateutil.tz.tzutc()

First I need to find the part of the XML file that contains the data I am interested in. The XML file uses the plist format which is a hierarchical collection of dictionaries to whatever level is needed:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Major Version</key><integer>1</integer>
    <key>Minor Version</key><integer>1</integer>
    <key>Application Version</key><string>9.1.1</string>
    <key>Features</key><integer>5</integer>
    <key>Show Content Ratings</key><true/>
    <key>Music Folder</key><string>file://localhost/C:/Users/myuser/Music/iTunes/iTunes%20Music/</string>
    <key>Library Persistent ID</key><string>63AA90F75B90877E</string>
    <key>Tracks</key>
    <dict>
        <key>1234</key>
        <dict>
            <key>Track ID</key><integer>1234</integer>
            ...
        </dict>
        <key>1235</key>
        <dict>
             <key>Track ID</key><integer>1235</integer>
             ...
        </dict>
    </dict>
    ...
</dict>
</plist>

I am interested in the <dict> tag immediately following <key>Tracks</key>, so I just look at all nodes in the tree until I find a node called key that has the text Tracks:

def find_tracks(root):
    tracks_is_next = False
    for node in root[0]:
        if tracks_is_next:
            return node
        if node.tag == "key" and node.text == "Tracks":
            tracks_is_next = True

This is now a sequence of key/dict nodes, repeated once for each item in my library. As an example, the XML entry Track ID 15494 is:

<key>15494</key>
<dict>
    <key>Track ID</key><integer>15494</integer>
    <key>Name</key><string>I Can't Drive 55</string>
    <key>Artist</key><string>Sammy Hagar</string>
    <key>Album Artist</key><string>Sammy Hagar</string>
    <key>Album</key><string>Unboxed</string>
    <key>Genre</key><string>Rock</string>
    <key>Kind</key><string>Purchased AAC audio file</string>
    <key>Size</key><integer>8833599</integer>
    <key>Total Time</key><integer>252906</integer>
    <key>Disc Number</key><integer>1</integer>
    <key>Disc Count</key><integer>1</integer>
    <key>Track Number</key><integer>10</integer>
    <key>Track Count</key><integer>12</integer>
    <key>Year</key><integer>1994</integer>
    <key>Date Modified</key><date>2012-07-26T23:40:24Z</date>
    <key>Date Added</key><date>2010-09-13T07:16:58Z</date>
    <key>Bit Rate</key><integer>256</integer>
    <key>Sample Rate</key><integer>44100</integer>
    <key>Play Count</key><integer>18</integer>
    <key>Play Date</key><integer>3478580017</integer>
    <key>Play Date UTC</key><date>2014-03-25T00:13:37Z</date>
    <key>Skip Count</key><integer>1</integer>
    <key>Skip Date</key><date>2013-12-02T11:08:17Z</date>
    <key>Release Date</key><date>1994-03-15T08:00:00Z</date>
    <key>Rating</key><integer>100</integer>
    <key>Album Rating</key><integer>100</integer>
    <key>Album Rating Computed</key><true/>
    <key>Artwork Count</key><integer>1</integer>
    <key>Sort Album</key><string>Unboxed</string>
    <key>Sort Artist</key><string>Sammy Hagar</string>
    <key>Sort Name</key><string>I Can't Drive 55</string>
    <key>Persistent ID</key><string>B386C27DF668DC52</string>
    <key>Track Type</key><string>File</string>
    <key>Purchased</key><true/>
    <key>Location</key><string>file://localhost/Users/myuser/Music/ITLConversion/iTunes%20Media/Sammy%20Hagar/Unboxed/10%20I%20Can't%20Drive%2055.m4a</string>
    <key>File Folder Count</key><integer>4</integer>
    <key>Library Folder Count</key><integer>1</integer>
</dict>

I want to convert this to a set of key/value pairs. For keys, I will normalize the existing keys by converting them to all lower-case and replacing spaces with a single underscore: Track ID becomes track_id, File Folder Count becomes file_folder_count, and so on. I cannot use track_id as keys in my overall data structure as these IDs will change over time (from file to file), so instead I track songs by persistent_id (B386C27DF668DC52 in the XML above); these are unique for every track and unchanging even if the track_id changes (if you rename the song, for instance).

To convert this XML entry, I only care about everything inside the <dict> tags. I want to run the following function on each of these entries:

def parse_track(node):
     details = {}
     key = None
     for element in node:
          if key:
               if element.tag == "integer":
                    value = int(element.text)
               elif element.tag == "date":
                    value = dateutil.parser.parse(element.text)
               else:
                    value = element.text
               details[key] = value
               key = None
          elif element.tag == "key":
               key = '_'.join(element.text.lower().split(' '))
          else:
               print "skipping", element.tag
     return details

The function looks at the node passed to it and finds all child nodes. It tracks whether this is a <key> tag, and if it is it stores the (normalized) name of the key. If it is not such a tag but the last tag it saw was a key, it uses the last name it found as the key for the current value. In this way it alternates between remembering the key it found, then storing the subsequent data in a dictionary using the previously found key, then repeating until it has seen all the child nodes. It also distinguishes between integers, dates, and other (string) data. I store the complete list of all songs found in a dictionary called master. The track shown above has a Persistent Track ID B386C27DF668DC52 and is converted to the following dictionary:

{
    'rating': 100,
    'track_type': 'File',
    'bit_rate': 256,
    'purchased': None,
    'year': 1994,
    'artwork_count': 1,
    'sort_name': "I Can't Drive 55",
    'skip_date': datetime.datetime(2013, 12, 2, 11, 8, 17, tzinfo=tzutc()),
    'size': 8833599,
    'album': 'Unboxed',
    'album_rating_computed': None,
    'file_folder_count': 4,
    'track_count': 12,
    'track_id': 15494,
    'disc_number': 1,
    'location': "file://localhost/Users/myuser/Music/ITLConversion/iTunes%20Media/Sammy%20Hagar/Unboxed/10%20I%20Can't%20Drive%2055.m4a",
    'library_folder_count': 1,
    'sort_album': 'Unboxed',
    'total_time': 252906,
    'play_date': 3478580017,
    'date_modified': datetime.datetime(2012, 7, 26, 23, 40, 24, tzinfo=tzutc()),
    'play_count': 18,
    'genre': 'Rock',
    'date_added': datetime.datetime(2010, 9, 13, 7, 16, 58, tzinfo=tzutc()),
    'album_artist': 'Sammy Hagar',
    'name': "I Can't Drive 55",
    'kind': 'Purchased AAC audio file',
    'album_rating': 100,
    'disc_count': 1,
    'artist': 'Sammy Hagar',
    'release_date': datetime.datetime(1994, 3, 15, 8, 0, tzinfo=tzutc()),
    'play_date_utc': datetime.datetime(2014, 3, 25, 0, 13, 37, tzinfo=tzutc()),
    'persistent_id': 'B386C27DF668DC52',
    'sort_artist': 'Sammy Hagar',
    'track_number': 10,
    'sample_rate': 44100,
    'skip_count': 1
}

I use a list called playlist to keep a record of every song played and the associated time it was played. Each item in the list is a tuple containing the timestamp and the persistent ID. As I load each library file I determine the last time each track was played and compare it to the last known time it was played; if they do not match, then I add a tuple to playlist for that track (if they match, then I already know it was played at that time and don’t need to add it).

One special requirement I have is wanting to insert 'estimated' play_date values when a track has been played more than once between backup files. As an example, if a track has been played 3 times between the last XML file and the current one, I want to use the last_played date as once of those three and evenly space two dates in between the last date and the previously known last date. This function performs that task, taking the overall playlist, the number of plays to insert, the key, and the previous and current dictionaries for the track in question:

def insert_missing(playlist, count, key, previous, track):
    print "missing:", count, track["name"], previous[last_played_key], track[last_played_key]
    last_played = previous[last_played_key]
    this_played = track[last_played_key]
    delta = this_played - last_played
    interim_delta = delta / (count + 1)
    for interim in range(1, count + 1):
        played_date = last_played + interim_delta * interim
        played_date = datetime.datetime.combine(played_date.date(), datetime.datetime.min.time())
        played_date = played_date.replace(tzinfo=tzutc)
        playlist.append((played_date, key))

That's all the plumbing necessary to parse a file. The whole thing is assembled into one function that parses each file passed to it and merges all the data structures with updates and changes as each file is read. I iterate over all tracks in a file and extract the tracks and the playlist additions for that file:

def read_file(file, master, playlist):
    print "parsing {}".format(file)
    file_db = {}
    tree = ElementTree.parse("{}/{}".format(src_directory, file))
    root = tree.getroot()
    tracks = find_tracks(root)
    for element in tracks:
        if element.tag == "dict":
            track = parse_track(element)
            file_db[track[persistent_id]] = track
    for key in file_db.keys():
        if last_played_key in file_db[key]:
            last_played = file_db[key][last_played_key]
            if key in master:
                if last_played_key in master[key] and master[key][last_played_key] != last_played:
                    if play_count_key in master[key] and play_count_key in file_db[key]:
                        num_plays = file_db[key][play_count_key] - master[key][play_count_key]
                        if num_plays > 1:
                            insert_missing(playlist, num_plays-1, key, master[key], file_db[key])
                    playlist.append((last_played, key))
            else:
                playlist.append((last_played, key))
        master[key] = file_db[key]

Now I have all the necessary functions to be able to iterate over all the files in the Previous iTunes Library directory:

files = os.listdir(src_directory)
files = [ x for x in files if x.endswith('.xml') and not x.startswith('.') ]

playlist = []
master = {}
for file in files:
    read_file(file, master, playlist)

When I’m done with this, I can sort and print the playlist:

playlist = sorted(playlist)
for p in playlist:
    details = master[p[1]]
    print p[0], p[1], details['name']

That's it! I now have a complete playlist for the full history of all my iTunes backups.