1. The Problem
Recently I came across a project that my colleagues heavily rely themself on excel. Normally that is not a problem and I do think excel is a good tool as well.
However in the workflow we have a text column waiting to be cleaned and the logic is somehow complex. Actually we have about 10-15 if-else statements.
Algorithmically that is not a problem because it is only a FuzzBuzz question. Which means you have an array, and you can loop over the array, and check each one of the element whether it meets your condition.
If it does meet the condition, you output a label and store it in an new array. You can write as many if-else statements as you need.
But in excel that is a real problem. In the excel's cell you can hardly write meaningful funtion when the logic is complex. Besides, excel lacks some most commonly used functions, like string match.
Although I can use R or Python to handle the problem but my colleague's excel background requires an easy solution. Firstly I think maybe build a console applicaiton to do the cleaning job is a good idea, for example C# or Go.
Somehow, I am thinking whether it is a overkill on this FuzzBuzz question and I do meet some GBK csv encoding problem in both C# and Go.
That is how I came back to VBA, if one method is stupid but works, it is not stupid after all.
2. Normal VBA
Press Alt+F11 we can inter Excel VBA editor. We will build our function here. Later we can link the function to a button into our spreadsheet.
I don't want to off topic so let's say we copy-paste the csv data into an excel Sheet1. (I mean it does have a method to automatically copy data from csv file into excel sheet, but we don't discuss in this article)
Althought in excel spreadsheet we lack of some commonly used built-in functions, but in VBA we have some good functions:
(1). Check data range
rows: Sheet1.UsedRange.Rows.Count
Warning: This result could easily exceed integer, so do not declare an integer to store it.
(2). Initialize a cell
Cells(num, label) = ...
eg: Cells(1, "G") = "Classified_Label"
(3). Data type change
Cstr()
eg: current = CStr(Cells(1, "D"))
There are also other kind of C*() as a family, like CInt(), CDbl() and so on.
(4). String Detect
InStr()
eg: InStr(current, "Alex")
This function is especially useful because in excel spreadsheet FIND() is too bad design to use.
(5). Substring
Mid()
eg: Mid(current, 2, 2)
Go back to the probelm. The For-loop with If-Else is also as normal as we will do other languages and it is much easier than write them in excel spreadsheet.
Let's say our original text column is "D" and we want to output result in column "G". This will make us easy to write example code.
Sub adjust() ' determine data range Dim length ' do not decalre it as an integer, because the range could bigger than intger. length = Sheet1.UsedRange.Rows.Count ' initialize column name Cells(1, "G") = "Classified_Label" ' adjust data ' do not declare it as an integer Dim i For i = 2 To length Dim current As String current = CStr(Cells(i, "D")) ' CStr() means toString() If InStr(current, "Alice") Then Cells(i, "G") = "Alex" ElseIf InStr(current, "Bob") Then Cells(i, "G") = "Bobe" ' .... write as many as your logic need Else Cells(i, "G") = "Others" EndIf Next End Sub
3. VBA with ADO
Ado is a library of standars and functions developed by MicroSoft. Using this tool we can use SQL statement in VBA.
In real work, I occasionally meet someone's PC that cannot use this method. Maybe they lack of drivers or they are using bad version of excel.
But still it is good to know and be packed at our backpoctet.
Sub adjust() Dim CNN As Object Set CNN = CreateObject("ADODB.CONNECTION") Dim Driver As String Driver = "Driver={Microsoft Text Driver (*.txt; *.csv)};" &_ "DBQ=" & ThisWorkbook.Path & "\;" CNN.Open "Provider=MSDASQL;" & Driver Dim file As String file = "data.csv" ' can use InBox() function to require any data file on disk Dim Query As String Query = "SELECT * FROM " & file ' this step can be as complex as you need Dim RS As Object Set RS = CreateObject("ADODB.RECORDSET") RS = CNN.Excute(Query) With RS For i = 1 To .Fields.Count Cells(1, i).Value = .Fields(i - 1).Name Next Range("A2").CopyFromRecordset RS ActiveSheet.Cells.Font.Name = "宋体" .Close End With CNN.Close Set CNN = Nothing Set RS = Nothing End Sub
The Query can be as complex as you need like what you will do in real SQL query, but in a more ACCESS SQL way. For example, we don't use CASE WHEN but IIF().
For an easy example:
Query = "SELECT *, IIF(InStr(Name, "Alice"), "Alex", " & _ "IIF(InStr(Name, "Bob"), "Bobe", "Others"))"
4. Finale
After our function is built, we can go back to Excel spreadsheet and add an geometry object into the spreadsheet. Right click the object and link our VBA function to the button.