#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2014
    Posts
    4
    Rep Power
    0

    Help with complex (for me!) IsMatch/Replace ...


    Hi all,
    I have to imporve performance of a piece of code which has been coded originally without regex (see commented part of the code snippet). I managed to change it to use Regex.IsMatch but it is still not much faster. I wonder I could change the query to do a replace of characters in one step.
    I have list of input values of variable lenght and variable content e.g.
    215XZ-2-7

    EA2325

    215T098

    218VZAH031-Y

    215XZ-1-7

    EA0436

    GO9901

    I have to replace each letter with an X, each number with a 9 and leave anything else as is.

    The code so far looks like this:
    Code:
    internal String DetermineFormat(String objectCode)
    
            {
    
                StringBuilder format = new StringBuilder();
    
                for (int i = 0; i < objectCode.Length; i++)
    
                {
    
             
    
                    if (System.Text.RegularExpressions.Regex.IsMatch(objectCode.Substring(i, 1).ToUpper(), @"^[ABCDEFGHIJKLMNOPQRSTUVWXYZ]+$"))
    
                    {
    
                        format.Append("X");
    
                    }
    
                    else if (System.Text.RegularExpressions.Regex.IsMatch(objectCode.Substring(i, 1).ToUpper(), @"^[0123456789]+$"))
    
                    {
    
                        format.Append("9");
    
                    }                
    
                    else
    
                    {   //append the character as is
    
                        format.Append(objectCode.Substring(i, 1));
    
                    }
    
                    //Below from the MS Access app:
    
                    //if ("ABCDEFGHIJKLMNOPQRSTUVWXYZ".IndexOf(objectCode.Substring(i, 1).ToUpper()) > -1)
    
                    //{
    
                    //    format.Append("X");
    
                    //}
    
                    //else if ("0123456789".IndexOf(objectCode.Substring(i, 1).ToUpper()) > -1)
    
                    //{
    
                    //    format.Append("9");
    
                   //}              
    
                    //else
    
                    //{   //append the character as is
    
                    //    format.Append(objectCode.Substring(i, 1));
    
                    //}
    
                }
    
                return format.ToString(); 
    
            }


    I would really like instead of looping through each objectCode to transform each on in one step, to hopefully imporve performance. We are reading through large files to do this, so I am hoping to bring time down a lot.
    I am a total regex newbie though and I am totally lost as what to do.
    Any help much appreciated.
    Thanks very much in advance!
    Andrea
  2. #2
  3. Contributing User
    Devshed Demi-God (4500 - 4999 posts)

    Join Date
    Aug 2011
    Posts
    4,701
    Rep Power
    480
    The algorithm to construct the regular expression finite state machine takes a long time. You've built 1 or 2 new FSMs per character. (The FSMs must be internally cached and are not rebuilt otherwise your comment would have been "my program crawls" instead of "is not much faster".) The python RE module has an option to compile the RE into an object, which one stores to a variable. Later you can call that object's match function. Hence your version of the program is slow because the RE code needs to poke around its RE cache for the FSM corresponding to "^[ABCDEFGHIJKLMNOPQRSTUVWXYZ]+$" or to the digits.

    The other version was slow because the most likely algorithm for indexOf is a linear search. For these non-alphanumeric characters the algorithm multiplies its work by 36.

    If in whatever language can directly compare characters and are working in ASCII I'd expect the utterly straightforward test where AND is probably && to run substantially faster.
    Code:
    if (('A' <= objectCode.Substring(i, 1).ToUpper()) AND (objectCode.Substring(i, 1).ToUpper() <= 'Z')) {
      //append 'X' to format
    } else if (('0'<=objectCode.Substring(i, 1)) AND (objectCode.Substring(i, 1) <= '9')) {
      //append '9' to format
    }
    If string comparisons aren't allowed you should instead be able to use the ordinal values ord(objectCode.Substring(i, 1)) and compare integers.

    • conversion to uppercase should be a 1 machine mask instruction (binary and). It's okay to call that twice, probably faster than saving to a temp variable.
    • You don't need uppercase to test digits (unless perhaps there's some funky unicode trickery taking place).
    • The strings you're matching have length 1. The ^+$ characters of your RE just cause the RE routines to construct overly complex FSMs
    • You repeatedly call the Substring method. This probably has no performance impact. But it might. See list item 1 for similar solution.


    The fastest algorithm I know of for this problem prepares a list of replacement values each character, and then in the loop just index into that list using the ASCII value of the current character. Thereby removing the if statements altogether from the data in the files. Build a look-up table in known time over the 256 characters, and then directly translate your files of unknown size.
    [code]Code tags[/code] are essential for python code and Makefiles!
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2014
    Posts
    4
    Rep Power
    0

    Thanks


    Hi David,
    thanks very much for your comprehensive reply. I tried all the suggestions you made, the ordinal and also the lloading a table ahead with all the values and indexing into it. But all the different variations hardly differ at all in speed. My test file with a thousand codes takes pretty much one minute for every variation, differing by a second or two.
    My current code looks like this, pre-loading a lookup tableand then indexing into it. Is this what you meant ?

    Definition of the lookup table:
    Code:
     private readonly static Dictionary<char, char> LookupTable = new Dictionary<char, char>();
    Loading the lookup at the start:
    Code:
    private void LoadLookup()
            {
                string numbers = "0123456789";
                string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
    
                foreach (char c in numbers)
                {
                    LookupTable.Add(c, '9');
                }
                foreach (char c in chars)
                {
                    LookupTable.Add(c, 'X');
                }
            }
    Then creating the new format string:
    Code:
    internal String DetermineFormat(String objectCode)
            {
                
                var format = new StringBuilder();
                char newChar;
    
                for (int i = 0; i < objectCode.Length; i++)
                {
                    if (LookupTable.TryGetValue(Convert.ToChar(objectCode.Substring(i, 1)), out newChar))
                    {
                        format.Append(newChar);
                    }
                    else
                    {
                        format.Append(objectCode.Substring(i, 1));
                    }
                    
                    ///////////////////////////////// option 1
                   //if ((String.CompareOrdinal("A", objectCode.Substring(i, 1).ToUpper()) <= 0) && (String.CompareOrdinal(objectCode.Substring(i, 1).ToUpper(), "Z") <= 0)) {
                   //   //append 'X' to format
                   //    format.Append("X");
                   //}
                   //else if ((String.CompareOrdinal("0", objectCode.Substring(i, 1).ToUpper()) <= 0) && (String.CompareOrdinal(objectCode.Substring(i, 1).ToUpper(), "9") <= 0))
                   //{
                   //   //append '9' to format
                   //    format.Append("9");
                   //}
                   //else
                   //{   //append the character as is
                   //    format.Append(objectCode.Substring(i, 1));
                   //}
                    ///////////////////////////////////////////////////////////////////////////////////////// option 2
                    //if (System.Text.RegularExpressions.Regex.IsMatch(objectCode.Substring(i, 1).ToUpper(), @"^[ABCDEFGHIJKLMNOPQRSTUVWXYZ]+$"))
                    //{
                    //    format.Append("X");
                    //}
                    //else if (System.Text.RegularExpressions.Regex.IsMatch(objectCode.Substring(i, 1).ToUpper(), @"^[0123456789]+$"))
                    //{
                    //    format.Append("9");
                    //}
                    //else  if (System.Text.RegularExpressions.Regex.IsMatch(objectCode.Substring(i, 1).ToUpper(), @"^[~]+$"))
                    //{
                    //    //Do nothing, ignore character, just for reference
                    //}
                    //else
                    //{   //append the character as is
                    //    format.Append(objectCode.Substring(i, 1)); 
                    //}
    
                    //Below from the MS Access app:////////////////////////////////////////////////////////////////////option 3
                    //if ("ABCDEFGHIJKLMNOPQRSTUVWXYZ".IndexOf(objectCode.Substring(i, 1).ToUpper()) > -1)
                    //{
                    //    format.Append("X");
                    //}
                    //else if ("0123456789".IndexOf(objectCode.Substring(i, 1).ToUpper()) > -1)
                    //{
                    //    format.Append("9");
                    //}
                    //else if ("~".IndexOf(objectCode.Substring(i, 1).ToUpper()) > -1)
                    //{
                    //    //Do nothing, ignore character, just for reference
                    //}
                    //else
                    //{   //append the character as is
                    //    format.Append(objectCode.Substring(i, 1)); 
                    //}
                }
                return format.ToString();
    
            }
    Is this what you were thinking ?
    Is there anything else I can do ?

    Thanks so much for your help
    Andrea



    Originally Posted by b49P23TIvg
    The algorithm to construct the regular expression finite state machine takes a long time. You've built 1 or 2 new FSMs per character. (The FSMs must be internally cached and are not rebuilt otherwise your comment would have been "my program crawls" instead of "is not much faster".) The python RE module has an option to compile the RE into an object, which one stores to a variable. Later you can call that object's match function. Hence your version of the program is slow because the RE code needs to poke around its RE cache for the FSM corresponding to "^[ABCDEFGHIJKLMNOPQRSTUVWXYZ]+$" or to the digits.

    The other version was slow because the most likely algorithm for indexOf is a linear search. For these non-alphanumeric characters the algorithm multiplies its work by 36.

    If in whatever language can directly compare characters and are working in ASCII I'd expect the utterly straightforward test where AND is probably && to run substantially faster.
    Code:
    if (('A' <= objectCode.Substring(i, 1).ToUpper()) AND (objectCode.Substring(i, 1).ToUpper() <= 'Z')) {
      //append 'X' to format
    } else if (('0'<=objectCode.Substring(i, 1)) AND (objectCode.Substring(i, 1) <= '9')) {
      //append '9' to format
    }
    If string comparisons aren't allowed you should instead be able to use the ordinal values ord(objectCode.Substring(i, 1)) and compare integers.

    • conversion to uppercase should be a 1 machine mask instruction (binary and). It's okay to call that twice, probably faster than saving to a temp variable.
    • You don't need uppercase to test digits (unless perhaps there's some funky unicode trickery taking place).
    • The strings you're matching have length 1. The ^+$ characters of your RE just cause the RE routines to construct overly complex FSMs
    • You repeatedly call the Substring method. This probably has no performance impact. But it might. See list item 1 for similar solution.


    The fastest algorithm I know of for this problem prepares a list of replacement values each character, and then in the loop just index into that list using the ASCII value of the current character. Thereby removing the if statements altogether from the data in the files. Build a look-up table in known time over the 256 characters, and then directly translate your files of unknown size.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2014
    Posts
    4
    Rep Power
    0
    Hi David,
    I have replied to this long time ago, but my post has not been posted ???
    Andrea

    Originally Posted by b49P23TIvg
    The algorithm to construct the regular expression finite state machine takes a long time. You've built 1 or 2 new FSMs per character. (The FSMs must be internally cached and are not rebuilt otherwise your comment would have been "my program crawls" instead of "is not much faster".) The python RE module has an option to compile the RE into an object, which one stores to a variable. Later you can call that object's match function. Hence your version of the program is slow because the RE code needs to poke around its RE cache for the FSM corresponding to "^[ABCDEFGHIJKLMNOPQRSTUVWXYZ]+$" or to the digits.

    The other version was slow because the most likely algorithm for indexOf is a linear search. For these non-alphanumeric characters the algorithm multiplies its work by 36.

    If in whatever language can directly compare characters and are working in ASCII I'd expect the utterly straightforward test where AND is probably && to run substantially faster.
    Code:
    if (('A' <= objectCode.Substring(i, 1).ToUpper()) AND (objectCode.Substring(i, 1).ToUpper() <= 'Z')) {
      //append 'X' to format
    } else if (('0'<=objectCode.Substring(i, 1)) AND (objectCode.Substring(i, 1) <= '9')) {
      //append '9' to format
    }
    If string comparisons aren't allowed you should instead be able to use the ordinal values ord(objectCode.Substring(i, 1)) and compare integers.

    • conversion to uppercase should be a 1 machine mask instruction (binary and). It's okay to call that twice, probably faster than saving to a temp variable.
    • You don't need uppercase to test digits (unless perhaps there's some funky unicode trickery taking place).
    • The strings you're matching have length 1. The ^+$ characters of your RE just cause the RE routines to construct overly complex FSMs
    • You repeatedly call the Substring method. This probably has no performance impact. But it might. See list item 1 for similar solution.


    The fastest algorithm I know of for this problem prepares a list of replacement values each character, and then in the loop just index into that list using the ASCII value of the current character. Thereby removing the if statements altogether from the data in the files. Build a look-up table in known time over the 256 characters, and then directly translate your files of unknown size.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2014
    Posts
    4
    Rep Power
    0
    Anyway,
    first of all thank you very much for your comprehensive answer!
    I tried all the suggestions you made, but performance changed only negligent. Thanks to your reply I however thougth about it much more, and came up with a one-line solution, which was just as fast (or slow) as the other solutions (ordinals, list of ascci values to index into etc):

    Code:
    String format = Regex.Replace(Regex.Replace(candidateTagData.ObjectCode, @"[0123456789]", "9"), @"[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]", "X");
    It takes lliek the other solutions about 1 minute per 1000 records. I looked closer into the code and made a 2/3 performance improvment down to about 20 seconds by removing two superfluous dbcontext.SaveChanges in the loop (we're using Entity Framework).
    This is teh fastest I can get it now, if you have other suggestions I am happy to try them out ;-)

    But so far, thanks again, I learned a lot from you!
    Regards, Andrea

IMN logo majestic logo threadwatch logo seochat tools logo