Sunday, February 10, 2013

Excel as an ETL Tool

This is probably going to seem extremely basic to some people, but I think it’s a very useful technique that many SQL developers might not know about.

You can use Excel as a quick and precise ETL tool.  Not for 10GB data files.  Not for data that’s not already in Excel.  But when you have data in Excel – especially if it’s messy – and you want to get in into SQL Server quickly, there’s a good way of doing that.

Let’s start with some messy Excel data.  Here are the current NBA standings that I copied from the NBA’s website.

image

Now, the data isn’t hideous.  It’s organized in a consistent manner.  It’s just not normalized.  The technique I’m going to describe works on data in all kinds of formats.

Create the Table

The first thing I’m going to do is define the table in SQL.  I’m going to write the create table statement in Excel, so that my entire script will be in one place.  So in cell M2, I’m going to write:

create table dbo.NbaStandings (Id integer identity,TimePeriod varchar(100), Conference varchar(50), Division varchar(50), Team varchar(50), Wins integer, Losses integer, WinPercent decimal(4, 3), GamesBack decimal(10, 2), ConfRecord varchar(10), DivRecord varchar(10), HomeRecord varchar(10), RoadRecord varchar(10), Last10 varchar(5), Streak varchar(5));

The column definitions are derived from the data, of course.  That’s the easy part.

Insert the Data, Part 1

Next I’m going to create the insert statements in Excel, referencing the data in the spreadsheet.  In cell M6, I’m going to write:

="insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values ("

But I'm not ready to write more just yet.  SQL Server string values have to be enclosed in single quotes.  Also, when you concatenate cell references in an Excel formula, you have to use double quotes.  I could write the rest of the insert statement that way, but that’s going to be one cumbersome, unreadable formula.  Some Excel macros will make it much easier to write.

Write the Macros (or just use mine)

Here are the macros I’m going to create in the Excel VBA editor.  I recommend keeping these macros in your personal macro workbook, so you don’t have to add them to every workbook you want to use them in.

This macro will escape single quotes in in a string, and enclose a string in single quotes:

Public Function SqlEncode(TheString) As String

 Public Function SqlEncode(TheString) As String  

Dim ReturnValue As String
ReturnValue = TheString

ReturnValue = Replace(ReturnValue, Chr(39), Chr(39) & Chr(39)) 'escape apostrophe
ReturnValue = Chr(39) & ReturnValue & Chr(39) 'enclose in apostrophes

SqlEncode = ReturnValue

End Function

This macro will take a single cell, and apply the macro above to its value, optionally adding a comma:

 Public Function SqlValue(TheRange As Range, Optional IncludeComma As Boolean = True) As String  

Dim ReturnValue As String

ReturnValue = ReturnValue & SqlEncode(TheRange.Text)

If IncludeComma Then ReturnValue = ReturnValue & ", "

SqlValue = ReturnValue

End Function

And this macro will take a range of cells, apply the SqlEncode macro to all of them, and concatenate them with commas:

 Public Function SqlValues(TheRange As Range) As String  

Dim TheCell As Range
Dim ReturnValue As String
Dim counter As Integer

counter = 0

For Each TheCell In TheRange.Cells
If counter > 0 Then ReturnValue = ReturnValue & ", "
ReturnValue = ReturnValue & SqlEncode(TheCell.Text)
counter = counter + 1
Next

SqlValues = ReturnValue

End Function

With those macros created, I’m ready for the next step.


Insert the Data, Part 2


The first real record on the spreadsheet is on line 6, but the values in the three rows above will also be part of the record.  This is where the data not being normalized makes things messy.  Here’s my continued formula:

="insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values (" & SqlValue(B$3) & SqlValue(B$4) & SqlValue(B$5)

Notice that I made the row reference for those three values absolute, because I don’t want them to change when I copy the formula to new rows.


The next part of the formula uses the data on the same row my formula is on, so I can use the SqlValues macro to take care of them all:

="insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values (" & SqlValue(B$3) & SqlValue(B$4) & SqlValue(B$5) & SqlValues(B6:L6) & ");"

My formula is now complete. Here's the resulting TSQL:

insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values ('2012-2013 Division Regular Season Standings', 'Eastern Conference', 'Atlantic', 'New York', '32', '16', '0.667', '0', '20-9', '41398', '19-6', '13-10', '41458', 'W 1');

Now I can go ahead and copy the formula into all of the other data rows on the spreadsheet:


image


I have to change the formula’s references to the conference and division cells several times when copying the formula, because this data set isn’t normalized, as previously pointed out.  I could have used an OFFSET formula in Excel to get around having to do that, but I wanted to keep this example relatively simple.


Execute the Script


The entire script to create and populate the table in now in column M of my spreadsheet.  I’m going to go ahead and copy that column into new query window in SQL Server:


image


Once I execute the script, the data is now in a SQL Server table:


image


Conclusion


Given how de-normalized that data was, this wouldn’t have been possible with a standard ETL engine such as SSIS, without first normalizing the data in Excel.


At my job, I get a lot of data in a format like this or worse, so I’ve found this technique very helpful over the years.  I hope others will fine it helpful as well.

No comments:

Post a Comment