Today I ran across an interesting problem: I need to find the differences between two “extremely large”
.csv files. The files are database snapshots from different points in time, and I am looking for “upserts” (lines that are new or changed) and ignoring the rest. For the purposes of this post, deletes are handled separately.
My first attempt was Python’s built-in package for this purpose:
difflib. This package is actually quite elegant:
import csv import difflib import os def load(name): with open(name) as fin: return fin.readlines() def get_upserts(f1, f2): return (row[2:] for row in difflib.Differ().compare(f1, f2) if row.startswith('+ ')) def main(f1_name='previous.csv', f2_name='current.csv'): f1 = load(f1_name) f2 = load(f2_name) reader = csv.reader(get_upserts(f1, f2)) for row in reader: print(row) if __name__ == '__main__': main()
This program loads both files into memory (
load() returns the entire file as a list of strings) and then
get_upserts() uses the
Differ.compare(f1, f2) method in
difflib to get the comparative delta. Notice that I only care about lines that start with
'+ ', as this is the indication in the output of
compare() that the line is new or changed in the second file.
Note that in this approach, I only need to convert lines of interest using
csv. This is because no processing is necessary until I know the line matters; this slight performance boost is nice.
difflib expects the full content of each file to be in RAM so that it can find more context than just the current line to make its decisions of what is new vs. what is the same. However, in my case the files are too big for RAM (several GB each). Additionaly,
difflib’s algorithm for finding deltas is quite slow because of the use of extensive context matching between lines.
I have two assumptions that drive the design of this:
- files are “large” (too big to fit in RAM)
- each row is in a form that the
csvpackage can interpret
- file contents are ordered
That last assumption is huge.
difflib has to derive what “ordered” means in the files it’s comparing, and to do so it needs to know other lines around it to compare chunks of a file at a time to understand whether any line is new, removed, modified, or unchanged. But if the lines of the file are ordered, I can actually make all my comparisons without using any information outside of the current line of interest from each file.
Files are CSV. Each row in the csv starts with an ID column and is followed by the rest of the row in a consistent order. The delimiter is not the standard CSV delimiter.
id|name|date_of_creation 1|batman|1939 2|superman|1938 3|wonder woman|1941 ... 100|the thing|1961 ...
I include the row with ID=100 to point out that the files are sorted by id, but the lines are not sorted alphanumerically (if they were, the row for “the thing” would be right after the row for “batman”). This will matter later.
If a row is only in the left file, then it was deleted or changed. I can ignore these for my purposes. If a row is only in the right file, then it is either added or changed. These are the rows I care about.
The basic approach will look like this:
afterfiles and consume the header row(s)
- read the first row from each file and normalize
- determine the numeric ID from the first columm
- emit a tuple that is the full list of fields in the row
- while either file still has rows:
- compare the row of interest from each file and:
- if the same, read and normalize the next line from each file
- if the ID of the left row is less than the ID of the right row, read and normalize the next line from
- else emit the right row as an upsert and read and normalize the next line from
- compare the row of interest from each file and:
- if there are still rows in the right file:
- emit all remaining rows from the right file
First let’s rewrite the
load() method to be a generator that emits one item each time
next() is called:
def file_reader(name, delimiter=',', has_header=False): with open(name) as fin: reader = csv.reader(fin, delimiter=delimiter) if has_header: next(reader) for row in reader: row = int(row) yield row
I’ve renamed the method as it no longer returns the full contents. I’ve also added a
delimiter parameter so that I can pass in an alternate delimiter if desired, and a
has_header parameter to indicate whether the first line should be ignored.
This method is now a generator. If you don’t know what generators are, you should learn. For the purposes of this post, assume that it simply means the function only emits the next row of output one line at a time when it is iterated on rather than returning the fully-developed output in one shot. The file is now only partially in memory at any point in time.
We use IDs to know whether the rows we are comparing are the same record. We can tell where we are in each file relative to the next by comparing the first element: if it’s the same, we’re on the same entity, and if either file’s id is lower than that iterator needs to be advanced. Any time we advance the after-side iterator, we also need to emit a row of output:
def get_upserts(reader_left, reader_right): left = next(reader_left) right = next(reader_right) while left and right: if left == right: left = next(reader_left) right = next(reader_right) elif left < right: left = next(reader_left) else: yield right right = next(reader_right) while right: yield right right = next(reader_right)
We can compare full lines in the first
if statement because identical tuples are the same as identical rows. In the
elif case, we are identifying lines that are on the left but not the right and ignoring those. The
else case catches all other instances, which are:
- item is on the right only
- item is on the left and right but different on the right
yielding those rows to the caller.
Finally, once left is empty, we emit the remaining rows on the right.
Now the main function looks like this:
def main(f1_name='previous.csv', f2_name='current.csv'): f1 = file_reader(f1_name, has_header=True) f2 = file_reader(f2_name, has_header=True) for row in get_upserts(f1, f2): print(row)