How to insert data in bulk in MySQL using Perl ?

A task I had to face recently was to upload tens millions of records in a MySQL table - as fast as possible: - reducing the wallclock and CPU time to a minimum - and more important, optimizing the MySQL downtime

There are many solutions to this problem, but few of them are really interesting in terms of time.

Let's play a bit with some of them.

As we need some test data, let's try to work with IEEE OUI numbers which are used for instance to identify the vendors of a NIC based on its MAC Address. There is a dataset of more than ten thousands entries there. It's not a lot, but it enough to show some of the basic concepts.

Getting the data

The first step is to get the OUI list - the list has currently about 10 000 vendors, but continuously increases.


cd ~
mkdir tmp
wget http://standards.ieee.org/regauth/oui/oui.txt

A look at the beginning of the file shows its structure:


OUI                           Organization
company_id                  Organization
Address

00-00-00   (hex)            XEROX CORPORATION
000000     (base 16)        XEROX CORPORATION
M/S 105-50C
800 PHILLIPS ROAD
WEBSTER NY 14580
UNITED STATES

Preparing the DB structure

A reasonable assumption is that we want to store: - the 3-octets prefix as a string. This field will be indexed (it should probably be 'unique' as well, but I'm afraid the IEEE database has duplicates there) - the vendor name as a string

This translates quite easily in MySQL:


mysql> create database tmp;
Query OK, 1 row affected (0.00 sec)

mysql> use tmp;
Database changed

mysql> create table oui (id int(11) primary key auto_increment, oui char(6) not null, vendor varchar(255) not null);
Query OK, 0 rows affected (0.01 sec)

mysql> create index idx_oui_oui on oui (oui);
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> grant all privileges on tmp.oui to 'test_perl'@'localhost';
Query OK, 0 rows affected (0.07 sec)

Naive approach: INSERT

The naive approach is the following: - parse the data - look for a new OUI number - when found, prepare the statement - and insert the value in the DB

Translated in Perl code, the result is the following:


#!/usr/bin/perl -w

use strict;

use DBI;

my $line;
my $dbh = DBI->connect("DBI:mysql:database=tmp", "test_perl");

$dbh->do ("TRUNCATE TABLE oui");
while ($line = <>)
{
    next unless $line =~ /^([0-F]{6})\s+\(base 16\)\s+(.*)$/;
    my ($oui, $vendor) = ($1, $2);
    my $sth = $dbh->prepare ("INSERT INTO oui(oui, vendor) VALUES (?, ?)") or die "Cannot prepare $!";
    $sth->execute ($oui, $vendor) or die "Cannot insert $!";
}

This runs in consistently 9.9 seconds of real time on my server:


mac@dax:~/tmp$ time ./test.pl oui.txt

real    0m9.943s
user    0m4.935s
sys     0m0.437s

Using prepared statements for what they're designed

The 'prepared' statements are exactly what they sound like: statements which get first prepared, then executed. The actual preparations has to be done one single time - and the statement can be executed many times... Let's try to factorize the statement preparation:


#!/usr/bin/perl -w

use strict;

use DBI;

my $line;
my $dbh = DBI->connect("DBI:mysql:database=tmp", "test_perl");

$dbh->do ("TRUNCATE TABLE oui");
my $sth = $dbh->prepare ("INSERT INTO oui(oui, vendor) VALUES (?, ?)") or die "Cannot prepare $!";
while ($line = <>)
{
    next unless $line =~ /^([0-F]{6})\s+\(base 16\)\s+(.*)$/;
    my ($oui, $vendor) = ($1, $2);
    $sth->execute ($oui, $vendor) or die "Cannot insert $!";
}

Bingo! We're down to an average of 7 seconds...


mac@dax:~/tmp$ time ./test.pl oui.txt

real    0m6.811s
user    0m2.406s
sys     0m0.355s

But there's much faster...

Using LOAD DATA INFILE

LOAD DATA INFILE is MySQL's high-speed solution to load data in bulk. It works by parsing a file on disk... That means that we have to customize a bit out Perl script to store the data in a file after parsing:


#!/usr/bin/perl -w

use strict;
use File::Temp qw/ tempfile /;

use DBI;

my $line;
my $dbh = DBI->connect("DBI:mysql:database=tmp", "test_perl");

# Create the temporary file
my ($fh, $filename) = tempfile(undef, UNLINK => 1);

while ($line = <>)
{
    next unless $line =~ /^([0-F]{6})\s+\(base 16\)\s+(.*)$/;
    my ($oui, $vendor) = ($1, $2);
    $fh->printf ("%s\t%s\n", $oui, $vendor);
}
$fh->close();

$dbh->do ("TRUNCATE TABLE oui");
$dbh->do ("LOAD DATA LOCAL INFILE '$filename' INTO TABLE oui(oui, vendor)");

And the timing gives...


mac@dax:~/tmp$ time ./test.pl oui.txt

real    0m0.493s
user    0m0.322s
sys     0m0.032s

Speedup of more than 10 times... And the DB's downtime is now limited to one single instruction. Difficult to do much better !