1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2013
    Rep Power

    Count the number of words (from a word list) in a string

    Hi guys,

    I have written this tiny code to count the number of US States in a certain string (up to 3MB txt file):
        @stringa = ("alabama","alaska","arizona", "arkansas","california","colorado","connecticut", "delaware", "florida", "georgia", "hawaii", "idaho", "illinois", "indiana", "iowa", "kansas", "kentucky", "louisiana", "maine", "maryland", "massachusetts", "michigan", "minnesota", "mississippi", "missouri", "montana", "nebraska", "nevada", "new hampshire", "new jersey", "new mexico", "new york", "north carolina", "north dakota", "ohio", "oklahoma", "oregon", "pennsylvania", "rhode island", "south carolina", "south dakota",  "tennessee", "texas", "utah", "vermont", "virginia", "washington", "west virginia", "wisconsin", "wyoming");    
    @num=(0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);   
    my $j=0;  
    foreach $line1 (@stringa) {  
     my $k=0;   $k++ while ($basic_text =~ m/\b$line1\b/g);
     $num[$j]=$k;    $j++;   }
    I am quite new to RegEx and I guess this code could be improved, especial with regards to the speed. Any suggestions?

    Thanks folks
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Rep Power
    It is right to assume that $basic_text contains the content your whole 3 MB file? If so, your way of parsing for states is probably the best way to do it despite the need to loop 50 times over the the string. You may want an "i" modifier to your regex to ignore case.

    However, in terms of your use of Perl data structures, I would store the states and number of occurrences in a hash (the state being the key and the number of occurrences the value), rather than two arrays. Something like this:

    Perl Code:
    my @list = ("alabama","alaska","arizona", "arkansas", ...);
    my %states_hash = map {$_, 0} for @list;
    foreach my $state (keys %states) {
         $states_hash{$state}++ while while ($basic_text =~ m/\b$line1\b/g);

    The alternative is to split your input into individual words, to check each word, and increase the hash value for that word if the hash entry exists (i.e. if the word is a US state). This second solution will probably be faster, but, depending on what your input looks like, you have to make sure that you actually get the words properly stripped of any punctuation or other unwanted characters.

    Possibly something along these lines would do the trick:

    Perl Code:
    my @list = ("alabama","alaska","arizona", "arkansas", ...);
    my %states_hash = map {$_, 0} for @list;
    $states_hash{$}++ for grep {exists $states_hash{$}} split /\b/, $basic_text;

    Thinking about it, this will probably be much faster because you go only once through your input.

IMN logo majestic logo threadwatch logo seochat tools logo