This is probably going to seem extremely basic to some people, but I think it’s a very useful technique that many SQL developers might not know about.
You can use Excel as a quick and precise ETL tool. Not for 10GB data files. Not for data that’s not already in Excel. But when you have data in Excel – especially if it’s messy – and you want to get in into SQL Server quickly, there’s a good way of doing that.
Let’s start with some messy Excel data. Here are the current NBA standings that I copied from the NBA’s website.
Now, the data isn’t hideous. It’s organized in a consistent manner. It’s just not normalized. The technique I’m going to describe works on data in all kinds of formats.
Create the Table
The first thing I’m going to do is define the table in SQL. I’m going to write the create table statement in Excel, so that my entire script will be in one place. So in cell M2, I’m going to write:
create table dbo.NbaStandings (Id integer identity,TimePeriod varchar(100), Conference varchar(50), Division varchar(50), Team varchar(50), Wins integer, Losses integer, WinPercent decimal(4, 3), GamesBack decimal(10, 2), ConfRecord varchar(10), DivRecord varchar(10), HomeRecord varchar(10), RoadRecord varchar(10), Last10 varchar(5), Streak varchar(5));
The column definitions are derived from the data, of course. That’s the easy part.
Insert the Data, Part 1
Next I’m going to create the insert statements in Excel, referencing the data in the spreadsheet. In cell M6, I’m going to write:
="insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values ("
But I'm not ready to write more just yet. SQL Server string values have to be enclosed in single quotes. Also, when you concatenate cell references in an Excel formula, you have to use double quotes. I could write the rest of the insert statement that way, but that’s going to be one cumbersome, unreadable formula. Some Excel macros will make it much easier to write.
Write the Macros (or just use mine)
Here are the macros I’m going to create in the Excel VBA editor. I recommend keeping these macros in your personal macro workbook, so you don’t have to add them to every workbook you want to use them in.
This macro will escape single quotes in in a string, and enclose a string in single quotes:
Public Function SqlEncode(TheString) As String
Public Function SqlEncode(TheString) As String
Dim ReturnValue As String
ReturnValue = TheString
ReturnValue = Replace(ReturnValue, Chr(39), Chr(39) & Chr(39)) 'escape apostrophe
ReturnValue = Chr(39) & ReturnValue & Chr(39) 'enclose in apostrophes
SqlEncode = ReturnValue
End Function
This macro will take a single cell, and apply the macro above to its value, optionally adding a comma:
Public Function SqlValue(TheRange As Range, Optional IncludeComma As Boolean = True) As String
Dim ReturnValue As String
ReturnValue = ReturnValue & SqlEncode(TheRange.Text)
If IncludeComma Then ReturnValue = ReturnValue & ", "
SqlValue = ReturnValue
End Function
And this macro will take a range of cells, apply the SqlEncode macro to all of them, and concatenate them with commas:
Public Function SqlValues(TheRange As Range) As String
Dim TheCell As Range
Dim ReturnValue As String
Dim counter As Integer
counter = 0
For Each TheCell In TheRange.Cells
If counter > 0 Then ReturnValue = ReturnValue & ", "
ReturnValue = ReturnValue & SqlEncode(TheCell.Text)
counter = counter + 1
Next
SqlValues = ReturnValue
End Function
With those macros created, I’m ready for the next step.
Insert the Data, Part 2
The first real record on the spreadsheet is on line 6, but the values in the three rows above will also be part of the record. This is where the data not being normalized makes things messy. Here’s my continued formula:
="insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values (" & SqlValue(B$3) & SqlValue(B$4) & SqlValue(B$5)
Notice that I made the row reference for those three values absolute, because I don’t want them to change when I copy the formula to new rows.
The next part of the formula uses the data on the same row my formula is on, so I can use the SqlValues macro to take care of them all:
="insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values (" & SqlValue(B$3) & SqlValue(B$4) & SqlValue(B$5) & SqlValues(B6:L6) & ");"
My formula is now complete. Here's the resulting TSQL:
insert into dbo.NbaStandings (TimePeriod, Conference, Division, Team, Wins, Losses, WinPercent, GamesBack, ConfRecord, DivRecord, HomeRecord, RoadRecord, Last10, Streak) values ('2012-2013 Division Regular Season Standings', 'Eastern Conference', 'Atlantic', 'New York', '32', '16', '0.667', '0', '20-9', '41398', '19-6', '13-10', '41458', 'W 1');
Now I can go ahead and copy the formula into all of the other data rows on the spreadsheet:
I have to change the formula’s references to the conference and division cells several times when copying the formula, because this data set isn’t normalized, as previously pointed out. I could have used an OFFSET formula in Excel to get around having to do that, but I wanted to keep this example relatively simple.
Execute the Script
The entire script to create and populate the table in now in column M of my spreadsheet. I’m going to go ahead and copy that column into new query window in SQL Server:
Once I execute the script, the data is now in a SQL Server table:
Conclusion
Given how de-normalized that data was, this wouldn’t have been possible with a standard ETL engine such as SSIS, without first normalizing the data in Excel.
At my job, I get a lot of data in a format like this or worse, so I’ve found this technique very helpful over the years. I hope others will fine it helpful as well.
No comments:
Post a Comment