#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0

    Get substring with a twist


    Hey guys.

    I am trying to figure out how to get the following substrings from email addresses.

    What I'm looking for is to get the substring 3 spaces before the @ down to the end of the line.

    Since I cannot put email addresses in a thread, suppose the * is the @ in the following examples.

    test*hotmail.com would be est*hotmail.com
    joe*hotmail.com would be *hotmail.com
    ed*hotmail.com would be *hotmail.com

    I read about a few functions but I can't put'em together to achieve this.

    split, index, or substr are the ones I was thinking about using.

    I appreciate the help.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    Show us what you've tried.

    Why would joe*hotmail.com end up being *hotmail.com instead of joe*hotmail.com?
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0
    This is what I have so far but it doesn't work. It doesn't even run.

    perl -F"\|" -ane '$F[12]=~substring(@F,index('@',@F)-3)'; print join("|",@F)' A185219_110_EM_20140123_1.txt | head

    -su: syntax error near unexpected token `('

    Thanks.
  6. #4
  7. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0
    This one runs but doesn't change the column.

    perl -F"\|" -ane '$F[14]=~substr($1,index(/'@/',$1)); print join("|",@F)' A185219_110_EM_20140123_1.txt | head

    Thanks!
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    Can you explain the details of what you think this statement is doing?
    $F[14]=~substr($1,index(/'@/',$1));

    Keep in mind that =~ is the regex binding operator, so where is the regex that it's bound to?

    It would be better if you did the development in a full script instead of a one liner so that you can better troubleshoot the steps. You can always convert it to a one liner later if needed.

    Please post at least 10 lines from your data file so that I can run a few tests. Alter the email addresses but keep them properly formatted as email addresses.

    Also please answer the question in my first post.
  10. #6
  11. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2014
    Posts
    4
    Rep Power
    0
    Hey.

    Somebody helped me out and built the following:

    perl -F"\|" -ane '($user,$dom)=split("@",$F[14]);$user=(length($user) > 3 ?"...".substr($user,-3):"..." );$F[39]="$user\@$dom";print join("|",@F)' A185219_110_EM_20140123_1.txt | head -5

    If you have a faster way to do this, let me know.

    Regards,
    Raf
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Intermediate (1500 - 1999 posts)

    Join Date
    Apr 2009
    Posts
    1,875
    Rep Power
    1225
    This is untested and would need to be benchmarked to see if it's faster.

    Code:
    perl -F"\|" -ane '$F[14] =~ s/^(\w+?([^@]{3}))@/$2@/ or $F[14] =~ s/^([^@]+)@/...@/; $F[39]="$1\@$dom"; $,="|"; print @F' A185219_110_EM_20140123_1.txt | head -5
  14. #8
  15. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    784
    Rep Power
    495
    @ sheds82:

    I will reiterate what FishMonger already told you: if you are a Perl beginner (and you are obviously a beginner), don't try to do Perl one-liners to start with, because some Perl experience is really needed to do one-liners for various reasons, one of the main ones being that if you have an error, it is difficult to find where. Another reason is that one-liners might lead you to adopt bad programming habits. Do actual Perl scripts with one instruction per line, the compiler will gently tell you on which line you have an error and it will be easier to find it and correct it.

    My guess is that using the index function is likely to be the fastest solution (but being fast is useful only if you really have a lot of lines to process). Let me give a full script for this one:

    perl Code:
    use strict;
    use warnings;
     
    my $c = 'thomas@yahoo.com';
    my $at_position = index $c, '@';
    my $start = $at_position >= 3 ? $at_position - 3 : 0;
    print substr $c, $start; # prints "mas@yahoo.com"


    And, BTW, this also works if there are less than 3 letters before the @.

    You could also use a regular expression:

    perl Code:
    use strict;
    use warnings;
     
    my $c = 'thomas@yahoo.com';
     print $1 if $c =~ /(.{1,3}\@.+)$/; # prints "mas@yahoo.com"


    This also works if there are less than 3 letters before the @.

    You could also use split:

    perl Code:
    use strict;
    use warnings;
     
    my $c = 'thomas@yahoo.com';
    my @d = split /@/, $c; 
    print substr ($d[0], -4, 3), '@', $d[1]; # prints "mas@yahoo.com"


    I leave it to you to test if this last one works when there are less than 3 letters before the @ (and to change it if it does not).

    If speed is really important, try the three scripts with a million lines or so, and see which one is the fastest (I should really be advising to use the Benchmark module, but since you probably don't know what a module is, just measure the run time, that should probably do it for this time). BTW, since you seem to be working under Unix or Linux, the Unix shell time command will measure run time for you. The syntax is as follows:

    Code:
    time perl my_perl_script.pl
    Once you've figured out which is the fastest, we can help you making a one-liner out of it if this is what you want.
    Last edited by Laurent_R; February 7th, 2014 at 05:27 PM.

IMN logo majestic logo threadwatch logo seochat tools logo