jeffrey k eliasen

jeffrey k eliasen

technologist philosopher agent of change

© 2020

Diffing Gianted Sorted Files in Python

Today I ran across an interesting problem: I need to find the differences between two “extremely large” .csv files. The files are database snapshots from different points in time, and I am looking for “upserts” (lines that are new or changed) and ignoring the rest. For the purposes of this post, deletes are handled separately.

Initial Attempt

My first attempt was Python’s built-in package for this purpose: difflib. This package is actually quite elegant:

import csv
import difflib
import os


def load(name):
    with open(name) as fin:
        return fin.readlines()


def get_upserts(f1, f2):
    return (row[2:] for row in difflib.Differ().compare(f1, f2) if row.startswith('+ '))


def main(f1_name='previous.csv', f2_name='current.csv'):
    f1 = load(f1_name)
    f2 = load(f2_name)
    reader = csv.reader(get_upserts(f1, f2))
    for row in reader:
        print(row)


if __name__ == '__main__':
    main()

This program loads both files into memory (load() returns the entire file as a list of strings) and then get_upserts() uses the Differ.compare(f1, f2) method in difflib to get the comparative delta. Notice that I only care about lines that start with '+ ', as this is the indication in the output of compare() that the line is new or changed in the second file.

Note that in this approach, I only need to convert lines of interest using csv. This is because no processing is necessary until I know the line matters; this slight performance boost is nice.

The Problem

difflib expects the full content of each file to be in RAM so that it can find more context than just the current line to make its decisions of what is new vs. what is the same. However, in my case the files are too big for RAM (several GB each). Additionaly, difflib’s algorithm for finding deltas is quite slow because of the use of extensive context matching between lines.

Assumptions

I have two assumptions that drive the design of this:

  • files are “large” (too big to fit in RAM)
  • each row is in a form that the csv package can interpret
  • file contents are ordered

That last assumption is huge. difflib has to derive what “ordered” means in the files it’s comparing, and to do so it needs to know other lines around it to compare chunks of a file at a time to understand whether any line is new, removed, modified, or unchanged. But if the lines of the file are ordered, I can actually make all my comparisons without using any information outside of the current line of interest from each file.

The Data

Files are CSV. Each row in the csv starts with an ID column and is followed by the rest of the row in a consistent order. The delimiter is not the standard CSV delimiter.

Example:

id|name|date_of_creation
1|batman|1939
2|superman|1938
3|wonder woman|1941
...
100|the thing|1961
...

I include the row with ID=100 to point out that the files are sorted by id, but the lines are not sorted alphanumerically (if they were, the row for “the thing” would be right after the row for “batman”). This will matter later.

An Algorithm

If a row is only in the left file, then it was deleted or changed. I can ignore these for my purposes. If a row is only in the right file, then it is either added or changed. These are the rows I care about.

The basic approach will look like this:

  1. open before and after files and consume the header row(s)
  2. read the first row from each file and normalize
    1. determine the numeric ID from the first columm
    2. emit a tuple that is the full list of fields in the row
  3. while either file still has rows:
    1. compare the row of interest from each file and:
      1. if the same, read and normalize the next line from each file
      2. if the ID of the left row is less than the ID of the right row, read and normalize the next line from before
      3. else emit the right row as an upsert and read and normalize the next line from after
  4. if there are still rows in the right file:
    1. emit all remaining rows from the right file

Implementation

Reading Data

First let’s rewrite the load() method to be a generator that emits one item each time next() is called:

def file_reader(name, delimiter=',', has_header=False):
    with open(name) as fin:
        reader = csv.reader(fin, delimiter=delimiter)
        if has_header:
            next(reader)
        for row in reader:
            row[0] = int(row[0])
            yield row

I’ve renamed the method as it no longer returns the full contents. I’ve also added a delimiter parameter so that I can pass in an alternate delimiter if desired, and a has_header parameter to indicate whether the first line should be ignored.

This method is now a generator. If you don’t know what generators are, you should learn. For the purposes of this post, assume that it simply means the function only emits the next row of output one line at a time when it is iterated on rather than returning the fully-developed output in one shot. The file is now only partially in memory at any point in time.

Finding Differences

We use IDs to know whether the rows we are comparing are the same record. We can tell where we are in each file relative to the next by comparing the first element: if it’s the same, we’re on the same entity, and if either file’s id is lower than that iterator needs to be advanced. Any time we advance the after-side iterator, we also need to emit a row of output:

def get_upserts(reader_left, reader_right):
    left = next(reader_left)
    right = next(reader_right)
    while left and right:
        if left == right:
            left = next(reader_left)
            right = next(reader_right)
        elif left[0] < right[0]:
            left = next(reader_left)
        else:
            yield right
            right = next(reader_right)
    while right:
        yield right
        right = next(reader_right)

We can compare full lines in the first if statement because identical tuples are the same as identical rows. In the elif case, we are identifying lines that are on the left but not the right and ignoring those. The else case catches all other instances, which are:

  • item is on the right only
  • item is on the left and right but different on the right

… and yielding those rows to the caller.

Finally, once left is empty, we emit the remaining rows on the right.

Calling

Now the main function looks like this:

def main(f1_name='previous.csv', f2_name='current.csv'):
    f1 = file_reader(f1_name, has_header=True)
    f2 = file_reader(f2_name, has_header=True)
    for row in get_upserts(f1, f2):
        print(row)

References