#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2012
    Posts
    34
    Rep Power
    6

    Fuzzy Regex Pattern Matching


    Hi, I am having some trouble finding an appropriate library for what I am trying to do in Java. I need to be able to perform approximate regex pattern matching in order to extract variables from a string. Let me provide an example:

    Suppose we have a model, say:
    "my name is <name> and I am <age> years old"

    This model would be converted to a standard Regex expression using groups to specify the variables, so something like:
    "my name is (\\p{Print}*) and I am (\\p{Print}*) years old"

    Then if we were given an input string:
    "name is John and I'm 25 years old"

    This does not match exactly but it is close, using normal Regex, there is no way to isolate "John" as <name> and "25" as <age> because the input doesn't technically match the expression. I have previously used TRE to solve this problem but it is specific to C and I want something native to Java. Are there any libraries out there that would be able to do this type of pattern matching? Note I have already looked at FREJ and it does not have the functionality I need.

    Alternatively, can anyone think of some way other than using Regex to solve this problem?

    Thanks.
    Last edited by ZGorlock; May 10th, 2016 at 07:38 PM.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2012
    Posts
    34
    Rep Power
    6
    I figured out a solution for this problem so I thought I would reply in case anyone else has the same problem in the future.

    I decided not to use regex but instead use a minimum string distance algorithm with 'dont care' characters allowed which serve as variables in this case.

    Here is the code:

    Code:
            final char varChar = '*';
            String pattern = "ab*fgh*l"; //pattern
            String text = "abcdefghijkl"; //text
    
            //remove double vars
            while (pattern.contains("**")) {
                pattern = pattern.replaceAll("\\*\\*", "*");
            }
    
            int m = pattern.length(); //i
            int n = text.length(); //j
    
            int g = 0; //# of dc chars
            int v = -1;
            for (int pi = 0; pi < m; pi++) {
                if (pattern.charAt(pi) == varChar) {
                    g++;
                    if (v == -1)
                        v = pi;
                }
            }
            if (g == 0)
                v = m + 1;
    
            int[][] D = new int[m + 1][n + 1];
            for (int j = 0; j <= n; j++)
                D[0][j] = 0;
            for (int i = 0; i <= v; i++)
                D[i][0] = i;
            for (int i = v + 1; i <= m; i++)
                D[i][0] = Integer.MAX_VALUE;
    
            for (int i = 1; i <= m; i++) {
                if (pattern.charAt(i - 1) == varChar) {
                    for (int j = 0; j <= n; j++)
                        D[i][j] = Integer.MAX_VALUE;
                    for (int j = 0; j <= n; j++) {
                        for (int h = j; h <= n; h++) { //assuming dont care set could span from 0 chars to the rest of the string
                            if (h <= n && D[i - 1][j] < D[i][h]) {
                                for (h = h; h <= n; h++)
                                    D[i][h] = D[i - 1][j];
                            }
                        }
                        if (D[i][n] == 0) //no need to continue this loop
                            break;
                    }
                } else {
                    for (int j = 1; j <= n; j++) {
                        int match = (pattern.charAt(i - 1) == text.charAt(j - 1)) ? 1 : 0;
                        D[i][j] = 1 + Math.min(Math.min(D[i - 1][j], D[i][j - 1]),
                                        D[i - 1][j - 1] - match);
                    }
                }
            }
    
            System.out.println("Pattern: " + pattern);
            System.out.println("Text:    " + text);
    
            System.out.println("The minimum edit distance is: " + D[m][n]);
    
            String[] vars = new String[g]; //stores the extracted variables
    
            int row = m; //pattern
            int col = n; //text
            int var = g - 1; //variable counter
            while (row > 1 || col > 1) { //start at bottom right corner, move to upper left corner
                if (pattern.charAt(row - 1) == varChar) { //the current row is a *
                    vars[var] = "";
                    while (col >= 1) { //move left to digest variable
                        if (D[row][col - 1] == D[row][col]) {
                            vars[var] = text.charAt(col - 1) + vars[var];
                            col--;
                        }
    
                        if (row > 1 && col >= 1) {
                            if (D[row][col - 1] != D[row][col]) { //stop at the next to last col in the sequence
                                int min = Math.min(D[row - 1][col], D[row - 1][col - 1]); //first move out of * row
                                if (D[row - 1][col] == min) {
                                    row--; //move up
                                }
                                else { //D[row - 1][col - 1] == min
                                    row--; //move diagonally up and left
                                    col--;
                                }
    
                                break; //out of variable
                            }
                        }
                    }
                    var--;
                }
                else if (pattern.charAt(row - 2) == varChar) { //if the next row is a *
                    col--; //move diagonally up and left
                    row--;
                }
                else {
                    if (row > 1) {
                        if (col >= 1) {
                            int min = Math.min(Math.min(D[row - 1][col], D[row - 1][col - 1]), //move
                                    D[row][col - 1]);
                            if (D[row - 1][col] == min) {
                                row--; //move up
                            } else if (D[row - 1][col - 1] == min) {
                                row--; //move diagonally up and left
                                col--;
                            } else { //D[row][col - 1] == min
                                col--; //move left
                            }
                        }
                        else {
                            row--; //move up
                        }
                    }
                    else {
                        col--; //move left
                    }
                }
            }
    
            System.out.println("Extracted Variables:");
            for (int i = 0; i < g; i++) {
                System.out.println("    " + vars[i]);
            }

    Comments on this post

    • Will-O-The-Wisp agrees : Thanks!
  4. #3
  5. Not An Expert
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jan 2015
    Posts
    404
    Rep Power
    3
    Thank you very much for sharing your solution!
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2012
    Posts
    34
    Rep Power
    6
    I have updated this algorithm to return all optimal variable extraction paths instead of just the first one it detects. Below is an example usage:

    Code:
    import java.util.ArrayList;
    import java.util.List;
    import java.util.Objects;
    
    public class FuzzyEditDistance
    {
        
        /**
         * The character used to specify a variable in a string compare.
         */
        public static final String varChar = "";
        
        /**
         * The current index in the vars and tokens list to store extracted vars and tokens for extractVariables().
         *
         * @see #extractVariables(String, String, int[][], int, int, List, List)
         */
        private static int extractionDepth = 0;
        
        
        /**
         * Main method.
         *
         * @param args Arguments to main method.
         */
        public static void main(String[] args)
        {
            String source = "What the hell are lobsters";
            String target = " s";
            
            List<List<String>> vars = new ArrayList<>();
            List<List<String>> tokens = new ArrayList<>();
            
            double score = stringCompare(target, source, vars, tokens, false);
            System.out.println("Score: " + score);
            System.out.println();
            
            System.out.println("---");
            for (int i = 0; i < vars.size(); i++) {
                System.out.println("Extraction #" + i);
                
                System.out.println("  Variables:");
                for (String var : vars.get(i)) {
                    System.out.println("    '" + var + "'");
                }
                System.out.println("  Tokens:");
                for (String token : tokens.get(i)) {
                    System.out.println("    '" + token + "'");
                }
                
                System.out.println("---");
            }
        }
        
        
        /**
         * Compares the closeness of two strings and extracts variables and non-variables is applicable.
         *
         * @param target     The string to compare against.
         * @param source     The string to compare.
         * @param vars       The list to store extracted variables in. This will contain a list of all optimal variable extraction sets.
         * @param tokens     The list to store extracted non-variable tokens in. This will contain a list of all optimal non-variable token extraction sets.
         * @param ignoreCase Whether to ignore case of not.
         * @return The closeness of the two strings, between 0 and 1.
         */
        public static double stringCompare(String target, String source, List<List<String>> vars, List<List<String>> tokens, boolean ignoreCase)
        {
            return (target.isEmpty() || source.isEmpty()) ? 0 : ((double) (target.length() - stringEditDistance(target, source, vars, tokens, ignoreCase)) / target.length());
        }
        
        /**
         * Calculates the distance between two strings with variable collection.
         *
         * @param pattern    The pattern to compare the string against.
         * @param text       The string to compare.
         * @param vars       The list to store extracted variables in. This will contain a list of all optimal variable extraction sets.
         * @param tokens     The list to store extracted non-variable tokens in. This will contain a list of all optimal non-variable token extraction sets.
         * @param ignoreCase Whether to ignore case or not.
         * @return The edit distance between the two strings.
         */
        public static int stringEditDistance(String pattern, String text, List<List<String>> vars, List<List<String>> tokens, boolean ignoreCase)
        {
            String doubleVarChar = varChar + varChar;
            
            while (pattern.contains(doubleVarChar)) {
                pattern = pattern.replaceAll(doubleVarChar, varChar); //remove double vars
            }
            
            int m = pattern.length(); //i
            int n = text.length(); //j
            
            int g = 0; //number of variables
            int v = -1; //index of first variable
            for (int pi = 0; pi < m; pi++) {
                if (Objects.equals(varChar, pattern.substring(pi, pi + 1))) {
                    g++;
                    if (v == -1) {
                        v = pi;
                    }
                }
            }
            if (g == 0) {
                v = m;
            }
            
            int[][] D = new int[m + 1][n + 1];
            
            for (int j = 0; j <= n; j++) {
                D[0][j] = 0;
            }
            for (int i = 0; i <= v; i++) {
                D[i][0] = i;
            }
            for (int i = v + 1; i <= m; i++) {
                D[i][0] = Integer.MAX_VALUE;
            }
            
            for (int i = 1; i <= m; i++) {
                if (Objects.equals(varChar, pattern.substring(i - 1, i))) {
                    for (int j = 0; j <= n; j++) {
                        D[i][j] = Integer.MAX_VALUE;
                    }
                    for (int j = 0; j <= n; j++) {
                        for (int h = j; h <= n; h++) { //assuming don't care set could span from 0 chars to the rest of the string
                            if ((h <= n) && (D[i - 1][j] < D[i][h])) {
                                for (h = h; h <= n; h++) {
                                    D[i][h] = D[i - 1][j];
                                }
                            }
                        }
                        if (D[i][n] == 0) //no need to continue this loop
                        {
                            break;
                        }
                        
                        printDmatch(pattern, text, D, i, j);
                    }
                } else {
                    for (int j = 1; j <= n; j++) {
                        int matchChar;
                        if (ignoreCase) {
                            matchChar = (Character.toUpperCase(pattern.charAt(i - 1)) == Character.toUpperCase(text.charAt(j - 1))) ? 1 : 0;
                        } else {
                            matchChar = (pattern.charAt(i - 1) == text.charAt(j - 1)) ? 1 : 0;
                        }
                        D[i][j] = 1 + Math.min(Math.min(D[i - 1][j], D[i][j - 1]),
                                D[i - 1][j - 1] - matchChar);
                        
                        printDmatch(pattern, text, D, i, j);
                    }
                }
            }
            
            if (tokens == null) {
                tokens = new ArrayList<>();
            }
        
            printDmatch(pattern, text, D);
            
            if ((vars != null) && (g > 0)) //if there are variables
            {
                extractVariables(pattern, text, D, pattern.length(), text.length(), vars, tokens);
            }
            
            return D[m][n];
        }
        
        /**
         * Extracts variables from the array produced by stringEditDistance().
         *
         * @param pattern The pattern to compare the string against.
         * @param text    The string to compare.
         * @param D       The array produced by stringEditDistance().
         * @param r       The row to start extraction from.
         * @param c       The column to start extraction from.
         * @param vars    The list to store extracted variables in. This will contain a list of all optimal variable extraction sets.
         * @param tokens  The list to store extracted non-variable tokens in. This will contain a list of all optimal non-variable token extraction sets.
         */
        private static void extractVariables(String pattern, String text, int[][] D, int r, int c, List<List<String>> vars, List<List<String>> tokens)
        {
            int row = r;
            int col = c;
            
            if ((r == pattern.length()) && (c == text.length())) {
                extractionDepth = 0;
            }
            int i = extractionDepth;
            
            if (vars.size() < (i + 1)) {
                vars.add(new ArrayList<>());
            }
            if (tokens.size() < (i + 1)) {
                tokens.add(new ArrayList<>());
            }
            
            StringBuilder thisVar = new StringBuilder();
            StringBuilder thisToken = new StringBuilder();
            while ((row >= 0) && (col > 0)) { //start at bottom right corner, move to upper left corner
                if ((row > 0) && Objects.equals(varChar, pattern.substring(row - 1, row))) { //the current row is a var char
                    
                    while (col >= 1) { //move left to digest variable
                        if (D[row][col - 1] == D[row][col]) {
                            thisVar.insert(0, text.charAt(col - 1));
                            col--;
                        }
                        
                        if ((row > 1) && (col >= 1)) {
                            if (D[row][col - 1] != D[row][col]) { //stop at the next to last col in the sequence
                                int min = Math.min(D[row - 1][col], D[row - 1][col - 1]); //first move out of * row
                                if (D[row - 1][col] == min) {
                                    row--; //move up
                                } else { //D[row - 1][col - 1] == min
                                    row--; //move diagonally up and left
                                    col--;
                                }
                                break; //out of variable
                                
                            } else if (D[row][col - 1] == D[row - 1][col]) { //detect alternate optimal variable path
                                List<String> popVars = new ArrayList<>(vars.get(i));
                                List<String> popTokens = new ArrayList<>(tokens.get(i));
                                popVars.add(thisVar.toString());
                                vars.add(popVars);
                                tokens.add(popTokens);
                                extractionDepth++;
                                extractVariables(pattern, text, D, row - 1, col, vars, tokens);
                            }
                        }
                        
                        printDmatch(pattern, text, D, row, col);
                    }
                    
                    vars.get(i).add(0, thisVar.toString().trim());
                    thisVar = new StringBuilder();
                    
                } else if ((row > 1) && Objects.equals(varChar, pattern.substring(row - 2, row - 1))) { //if the next row is a var char
                    thisToken.insert(0, text.charAt(col - 1));
                    tokens.get(i).add(0, thisToken.toString());
                    thisToken = new StringBuilder();
                    
                    col--; //move diagonally up and left
                    row--;
                    
                    printDmatch(pattern, text, D, row, col);
                    
                } else {
                    if (row > 0) {
                        int min = Math.min(Math.min(D[row - 1][col], D[row - 1][col - 1]), //move
                                D[row][col - 1]);
                        if (D[row - 1][col] == min) {
                            row--; //move up
                        } else if (col > 0) {
                            if (D[row - 1][col - 1] == min) {
                                thisToken.insert(0, text.charAt(col - 1));
                                row--; //move diagonally up and left
                                col--;
                            } else { //D[row][col - 1] == min
                                thisToken.insert(0, text.charAt(col - 1));
                                col--; //move left
                            }
                        }
                    } else {
                        thisToken.insert(0, text.charAt(col - 1));
                        col--; //move left
                    }
                    
                    if (col == 0) {
                        tokens.get(i).add(0, thisToken.toString());
                    }
        
                    printDmatch(pattern, text, D, row, col);
                }
            }
            
            printDmatch(pattern, text, D);
        }
        
        
        /**
         * Prints out the D matrix used in the stringEditDistance method.
         *
         * @param pattern The pattern to compare the string against.
         * @param text    The string to compare.
         * @param D       The matrix to print.
         * @see #printDmatch(String, String, int[][], int, int)
         */
        private static void printDmatch(String pattern, String text, int[][] D)
        {
            printDmatch(pattern, text, D, -1, -1);
        }
        
        /**
         * Prints out the D matrix used in the stringEditDistance method with an indicator at a location.
         *
         * @param pattern The pattern to compare the string against.
         * @param text    The string to compare.
         * @param D       The matrix to print.
         * @param row     The current row location of the algorithm.
         * @param col     The current column location of the algorithm.
         */
        private static void printDmatch(String pattern, String text, int[][] D, int row, int col)
        {
            int m = pattern.length(); //i
            int n = text.length(); //j
            
            for (int i = 0; i <= text.length(); i++) {
                if (i > 0) {
                    if (i == 1) {
                        System.out.print("   ");
                    }
                    System.out.print(text.charAt(i - 1) + "  ");
                } else {
                    System.out.print("   ");
                }
            }
            System.out.println();
            
            for (int i = 0; i <= m; i++) {
                if (i > 0) {
                    System.out.print(pattern.charAt(i - 1) + " ");
                } else {
                    System.out.print("  ");
                }
                
                for (int j = 0; j <= n; j++) {
                    
                    if ((i == row) && (j == col)) {
                        System.out.print("*");
                    } else {
                        System.out.print(" ");
                    }
                    
                    if (D[i][j] == Integer.MAX_VALUE) {
                        System.out.print("^ ");
                    } else {
                        System.out.print(D[i][j] + " ");
                    }
                    
                }
                System.out.println();
            }
            
            System.out.println();
        }
        
    }

IMN logo majestic logo threadwatch logo seochat tools logo