Text cleaning with Excel VBA

1. The Problem

Recently I came across a project that my colleagues heavily rely themself on excel. Normally that is not a problem and I do think excel is a good tool as well.

However in the workflow we have a text column waiting to be cleaned and the logic is somehow complex. Actually we have about 10-15 if-else statements.

Algorithmically that is not a problem because it is only a FuzzBuzz question. Which means you have an array, and you can loop over the array, and check each one of the element whether it meets your condition.

If it does meet the condition, you output a label and store it in an new array. You can write as many if-else statements as you need.

But in excel that is a real problem. In the excel's cell you can hardly write meaningful funtion when the logic is complex. Besides, excel lacks some most commonly used functions, like string match.

Although I can use R or Python to handle the problem but my colleague's excel background requires an easy solution. Firstly I think maybe build a console applicaiton to do the cleaning job is a good idea, for example C# or Go. 

Somehow, I am thinking whether it is a overkill on this FuzzBuzz question and I do meet some GBK csv encoding problem in both C# and Go.

That is how I came back to VBA, if one method is stupid but works, it is not stupid after all.

 

2. Normal VBA

Press Alt+F11 we can inter Excel VBA editor. We will build our function here. Later we can link the function to a button into our spreadsheet.

I don't want to off topic so let's say we copy-paste the csv data into an excel Sheet1. (I mean it does have a method to automatically copy data from csv file into excel sheet, but we don't discuss in this article)

Althought in excel spreadsheet we lack of some commonly used built-in functions, but in VBA we have some good functions:

(1). Check data range
        rows: Sheet1.UsedRange.Rows.Count
  Warning: This result could easily exceed integer, so do not declare an integer to store it.
(2). Initialize a cell
  Cells(num, label) = ...
  eg: Cells(1, "G") = "Classified_Label"
(3). Data type change
  Cstr()
  eg: current = CStr(Cells(1, "D"))
  There are also other kind of C*() as a family, like CInt(), CDbl() and so on.
(4). String Detect
  InStr()
  eg: InStr(current, "Alex")
  This function is especially useful because in excel spreadsheet FIND() is too bad design to use.
(5). Substring
  Mid()
  eg: Mid(current, 2, 2)

Go back to the probelm. The For-loop with If-Else is also as normal as we will do other languages and it is much easier than write them in excel spreadsheet.

Let's say our original text column is "D" and we want to output result in column "G". This will make us easy to write example code.

Sub adjust()

    ' determine data range
    Dim length ' do not decalre it as an integer, because the range could bigger than intger.
    length = Sheet1.UsedRange.Rows.Count
    
    ' initialize column name
    Cells(1, "G") = "Classified_Label"
    
    ' adjust data ' do not declare it as an integer
    Dim i
    For i = 2 To length
        Dim current As String
        current = CStr(Cells(i, "D")) ' CStr() means toString()
        If InStr(current, "Alice") Then
            Cells(i, "G") = "Alex"
        ElseIf InStr(current, "Bob") Then
            Cells(i, "G") = "Bobe"
        ' ....  write as many as your logic need
        Else
            Cells(i, "G") = "Others"
        EndIf
    Next

End Sub

 

3. VBA with ADO  

Ado is a library of standars and functions developed by MicroSoft. Using this tool we can use SQL statement in VBA.

In real work, I occasionally meet someone's PC that cannot use this method. Maybe they lack of drivers or they are using bad version of excel.

But still it is good to know and be packed at our backpoctet.

Sub adjust()

    Dim CNN As Object
    Set CNN = CreateObject("ADODB.CONNECTION")
    
    Dim Driver As String
    Driver = "Driver={Microsoft Text Driver (*.txt; *.csv)};" &_
        "DBQ=" & ThisWorkbook.Path & "\;"
    CNN.Open "Provider=MSDASQL;" & Driver
    
    Dim file As String
    file = "data.csv"  ' can use InBox() function to require any data file on disk
    Dim Query As String
    Query = "SELECT * FROM " & file ' this step can be as complex as you need

    Dim RS As Object
    Set RS = CreateObject("ADODB.RECORDSET")
    RS = CNN.Excute(Query)

    With RS
        For i = 1 To .Fields.Count
            Cells(1, i).Value = .Fields(i - 1).Name
        Next
      
        Range("A2").CopyFromRecordset RS
            ActiveSheet.Cells.Font.Name = "宋体"
        .Close
    End With
    
    CNN.Close
    Set CNN = Nothing
    Set RS = Nothing

End Sub

The Query can be as complex as you need like what you will do in real SQL query, but in a more ACCESS SQL way. For example, we don't use CASE WHEN but IIF().

For an easy example:

Query = "SELECT *, IIF(InStr(Name, "Alice"), "Alex", " & _
    "IIF(InStr(Name, "Bob"), "Bobe", "Others"))" 

  

4. Finale

After our function is built, we can go back to Excel spreadsheet and add an geometry object into the spreadsheet. Right click the object and link  our VBA function to the button.

 

上一篇:谷歌最新提出无需卷积、注意力 ,纯MLP构成的视觉架构


下一篇:逆向将物体检测数据集生成labelme标注的数据