#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    31
    Rep Power
    3

    Parsing a Text File


    Hi,

    I have a file that looks like below :

    Can someone kindly be able to give some code which can attain this parsing.

    The file looks like below

    ID Date Send Receive Note

    abcd-2e4d-7f2d-h3hsj8 2012-8-10 14:11:11 "peter.davis" <peter.davis@yahoo.com.in> "john.doe"
    <john.doe@gmail.com> HJ PLUS CHV: AMP_BW_Stra - 2 Users, 0 lines, 1month
    abcd-2e4d-7f2d-h3hsj8 2012-8-10 14:11:11 "jin.chin" <jin.chin@msn.com> "john.doe1"
    <john.doe1@gmail.com> HJ PLUS CHV: AMP_W_Sty - 2 Users, 0 lines, 1day


    I want to split the file so that it starts from the 3rd line ( basically ignores the line starting with ID and the empty lines ) and gives me a text file which will have below : ( ; the delimeter )

    abcd-2e4d-7f2d-h3hsj8;2012-8-10;davis, peter;peter.davis@yahoo.com;doe, john;john.doe@gmail.com;HJ PLUS CHV: AMP_BW_Stra;2;0

    abcd-2e4d-7f2d-h3hsj8;2012-8-10;chin, jin;jin.chin@msn.com;doe1, john;john.doe1@gmail.com;HJ PLUS CHV: AMP_W_Sty;2;0


    Basically the logic is :

    abcd-2e4d-7f2d-h3hsj8 -- This stays the same
    2012-8-10 14:11:11 --- This becomes 2012-8-10
    peter.davis --- This becomes davis, peter
    <peter.davis@yahoo.com.in> --- If it is @yahoo.com than make sure that .com is the last ending so this becomes peter.davis@yahoo.com
    john.doe -- This becomes doe, john
    john.doe@gmail.com -- This stays as john.doe@gmail.com
    HJ PLUS CHV: AMP_BW_Stra - 2 Users, 0 lines, 1month -- This becomes HJ PLUS CHV: AMP_BW_Stra;2;0


    Many Thanks
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2012
    Posts
    31
    Rep Power
    3
    Any suggestions pls.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Location
    Paris area, France
    Posts
    842
    Rep Power
    496
    Hello,

    a quick minimal try with regexes.

    Perl Code:
    while (<DATA>){
         next if /^ID/ or /^\s*$/; # removes ID line or empty lines
         s/(\d{4}-\d\d?-\d\d? )\d\d:\d\d:\d\d /$1/g; # changes date format
         s/"(\w+)\.(\w+)"/$2, $1/g; # changes "John.Doe" to Doe, John;
         s/\@yahoo.com[.\w]+/\@yahoo.com/g; # fixes the yahoo address
         s/- (\d+) Users, (\d+) lines, \d\s*\w+/;$1;$2/g; # replaces X Users Y lines Z days or months by ;X;Y
         print; # print $_ to the screen
    }
     
    __DATA__
     
    ID Date Send Receive Note
     
    abcd-2e4d-7f2d-h3hsj8 2012-8-10 14:11:11 "peter.davis" <peter.davis@yahoo.com.in> "john.doe"
    <john.doe@gmail.com> HJ PLUS CHV: AMP_BW_Stra - 2 Users, 0 lines, 1month
    abcd-2e4d-7f2d-h3hsj8 2012-8-10 14:11:11 "jin.chin" <jin.chin@msn.com> "john.doe1"
    <john.doe1@gmail.com> HJ PLUS CHV: AMP_W_Sty - 2 Users, 0 lines, 1day


    Used on the input data located at the end of the script after the __DATA__ tag, this outputs the following:

    Code:
    abcd-2e4d-7f2d-h3hsj8 2012-8-10 davis, peter <peter.davis@yahoo.com> doe, john
    <john.doe@gmail.com> HJ PLUS CHV: AMP_BW_Stra ;2;0
    abcd-2e4d-7f2d-h3hsj8 2012-8-10 chin, jin <jin.chin@msn.com> doe1, john
    <john.doe1@gmail.com> HJ PLUS CHV: AMP_W_Sty ;2;0
    So, unless I missed something, it works on the data you have presented, but there may be some variations on your data where the regular expressions would need to be fixed.

    For example, since there was one digit for the month, I assumed that month and day could have 1 or 2 digits. But I assumed in the code above that hours, minutes and seconds would always have 2 digits; if this is not the case, the full regex can be changed to (untested):

    Perl Code:
    s/(\d{4}-\d\d?-\d\d? )\d\d?:\d\d?:\d\d? /$1/g; # changes date format


    I could have changed it directly in the program above, but that was for me the opportunity to give you a perfect example that it is needed to know more about the data to make something that will work in all cases.

    In particular, I fixed the yahoo.com address, as specified in your message, but made absolutely no assumptions about other possible addresses errors, because there can be so many possible errors on e-mail addresses and so many rules on how to "validate" or fix e-mail addresses that it is an almost impossible task.

    At the end of last year, I made at my job such a program to try to detect and fix wrong e-mail addresses. The more we were fixing errors, the more we were finding new ones; we ended up with a 800-line program which could fix only about 15% of the wrong addresses in the database, with no guaranty that the new address was really correct: when you have john.doe@yahoo.con or john.doe@yahoo.cim, there is a reasonable chance that the correct one is john.doe@yahoo.com, but this is absolutely not sure. It turned out that about 90% of the corrected addresses worked (it worked in the sense that when used to send a mail, it did not trigger a return massage back from the provider saying that the address did not exist). But again, we could correct only 15% of the errors; most of the time, you know it is wrong, but just can't figure out how to correct it. For example, we found things like "xxx.xxx@xxx.xx", there is no way you can figure out what that should be. I had warned my clients (a large European cell-phone operator) that this was an almost impossible task, but they insisted that we do as much as we can and they were very happy with the result, as it saved them literally tens of thousands of phone calls to their customers to find out their correct address (the mail addresses were used for electronic invoicing, it is quite important to warn your customers that you are going to direct debit their bank account). After all, on a 35-million customer database with about 250,000 structurally wrong addresses, a 15% correction-rate turned out to be a quite positive result.

    Well, I'm digressing quite a bit here, but I just wanted to warn you how tricky the idea of correcting e-mail addresses can be.

IMN logo majestic logo threadwatch logo seochat tools logo