#1
  1. No Profile Picture
    Contributing User
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2003
    Posts
    3,505
    Rep Power
    594

    Regex to validate date


    I want to validate a date in the form mm/dd/yyyy and am not having much luck. This is what I thought should work (in perl):
    Code:
    $startDate!~/[0-1][0-9]\/[0-3][0-9]\/[1-2][0-9]\{3\}/;
    TIA.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    836
    Rep Power
    496
    Originally Posted by gw1500se
    I want to validate a date in the form mm/dd/yyyy and am not having much luck. This is what I thought should work (in perl):
    Code:
    $startDate!~/[0-1][0-9]\/[0-3][0-9]\/[1-2][0-9]\{3\}/;
    TIA.
    Hi,

    try this (my significant corrections in red):

    Code:
    $startDate =~ /[0-1][0-9]\/[0-3][0-9]\/[1-2][0-9]{3}/;
    At least, it works against today: "03/20/2013".

    It could be made somewhat simpler:

    Code:
    $startDate =~ /[0-1]\d\/[0-3]\d\/[1-2]\d{3}/;
    or even, at least in Perl (avoiding back shashes to escape slashes):

    Code:
    $startDate =~ m|[0-1]\d/[0-3]\d/[1-2]\d{3}|;
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2003
    Posts
    3,505
    Rep Power
    594
    Thanks for the reply. The !~ is correct as a not match in the context of my 'if' block. I was getting a syntax error on the {} which is why I escaped them. That eliminated the error and the script worked to the extent of the incorrect match string. I forgot about the \d syntax and was not aware of the m| | form. That is good to know and may solve my syntax problem or at least generate an error that points to where it really is.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  6. #4
  7. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Hi,

    don't use regexes for range checks. This will either not work (like all of your regexes), or it will be horribly complex and cumbersome.

    Regexes are made for syntactic checks. You can use them to check if the date format is correct. But don't try to actually check if it's a valid date. Use a date function or some library for that (I'm sure Perl has thousands of them).

    Your regexes will reject valid dates like 1/1/2013 while accepting nonsense like 13/39/2999, so that obviously doesn't work at all. In order to fix it, you'd have to make the regex 10 times as long, and you still wouldn't account for special cases like leap years.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  8. #5
  9. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    836
    Rep Power
    496
    I understand what you mean and agree in principle, but I do not fully agree.

    For sure, if you really want to make sure that a date is valid, then you want to use a special library, or simply a date function that will generate an error if the date does not exists. A regex is not good enough. Period.

    But under the same constraints, the only way to validate a phone number might be to call it to see if someone answers. Same thing for an e-mail address: send an email to the addressee and see if she or he answers.

    Sometimes (quite often, in fact), you only need to know if some part of a line might be a phone number, an e-mail address, a date, a social security number or an IP address, or if it is plain English text. In such cases, you generally don't need a full-proof validation of whether the address or number is valid, but just need to distinguish one from another. Using regexes for that is in my humble view a perfectly legitimate way of doing that, provided that you know that the fact that you have not rejected the data does not mean that the data is fully valid, it only means that it has not been rejected by the plausibility criteria that you have chosen.

    Another example is my work, which consists in trying to guarantee the consistency between the various applications or databases of a huge information system. Thousands of applications and databases, tens of millions of clients, many many terabytes of data. We usually compare data extracted from two databases, to try to detect data records that are in one database and not in the other one or the other way (which we call orphans), records having the same key but not the same detailed information, etc.

    Before we can run the comparison itself, we need to clean the data from both databases to remove things that we know to be inconsistent just by looking at the data itself. For example, if there are some duplicate records in one of the database (data records having the same key but not necessarily the same additional information), we need to remove these duplicates before the comparison, because it does not make sense to compare data records that are known to be wrong, they will need to be processed in a different manner. Similarly, in our specific setting, we would check that the mobile telephone number is a plausible mobile telephone number, that the customer number has 10 digits, that the IMSI (International Mobile Subscriber Identifier) has 15 digits, that the e-mail address more or less looks like an e-mail address, etc. It is only once all this data cleaning has been performed that we can run the comparison itself. This makes the analysis much easier, because the automated correction action will generally not be the same for data that we can know is plain wrong just by looking at it and for records that look OK but have differences between one database and the other.

    The work that I am describing has been developed by a department of about 50 persons of many different backgrounds over a period of 14 years, quite a bit of cumulated business logic, functional and technical experience. We are saving each month for our company millions of euros of revenues that would be otherwise lost, and saving even much more money (although it is more difficult to give a precise figure) in avoided customer dissatisfaction and avoided customer relationship costs. Many of our processes are scrutinized by chartered accountants and Sarbanes-Oxley compliance auditors. Well, in brief, we are really not amateurs in this business of data quality and data consistency. But, for us, checking whether a piece of data might be a date or not with a simple regex is relatively common practice because, although this is evidently insufficient, this is only one part of a fairly complicated process or progressive elimination of errors. Having said that, I must add that regex validation is only an extremely tiny part of the overall validation process, but at certain important steps, it does get 90% or 95% of the bad data out of the way. It does make a huge difference when human analysis is the alternative.

    Our databases have literally tens or hundreds of billions of dates (maybe more). Most of them are hopefully correct. But if, say, one out of one million is wrong, that makes possible hundreds of thousands of wrong dates. If a simple regex enables me to identify 95% of them, I have gone a very very long way in making my information system more accurate and more reliable. Even though I agree that it is only part of the story and more work has to be done.
  10. #6
  11. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1014
    Dates aren't like phone numbers or email addresses. You can tell right away if a date is valid or not. If you don't like external libraries, you can write your own function in a few lines. This is trivial.

    And of course there's the possibility of merely checking the format of a date. A regex is perfect for that:
    Code:
    /\d{1,2}\/\d{1,2}\/\d{4}/
    But you're doing neither. You "misuse" a tool made for syntactic checks to do rough semantic checks. This might be necessary or useful in your specific case, but it's hardly a sensible solution for standard applications.

    So I'm not saying that dates should always be completely validated. It might be enough to only check the format. But if you do want to check the exact day, month and year (as gw1500se clearly does), a regex is the wrong tool.

    I mean, let's say I wanna restrict an input field to a number between 1250 and 2000. I can do that by actually checking the number in my application. Or I can merely check if it's a three digits number. But you wouldn't come up with a regex like
    Code:
    /[12][0-2][0-5]\d/
    Because that's neither simple (to justify the imperfection) nor correct (to justify the complexity). It's simply the wrong tool.

    I'm no Perl guy, so I don't know the best library for that. But the DateTime module(?) looks like it should do the trick.
    The 6 worst sins of security ē How to (properly) access a MySQL database with PHP

    Why canít I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    836
    Rep Power
    496
    Originally Posted by Jacques1
    But you're doing neither. You "misuse" a tool made for syntactic checks to do rough semantic checks.
    I am not doing that. I just corrected syntax errors in the OP's regex. Nothing more.

    Frankly, in the data consistency case I was describing, I simply do not care whether the date is valid, the only thing I need to know is whether the date is correct. Whether it says March 36, or March 22, if this is incorrect, because the real date is Feb. 11, I could not care less in the end, it makes no difference to me, it is wrong and needs to be corrected, there is nothing to add.

    But, as I said, we have very good reasons to have a two-step procedure (saying two-step is a simplification, but that does not matter): preprocessing of the input data and then only comparison of the data between the two data sources.

    Two additional points.
    - In the context that I described, dates are coming from a database, so they are generally valid in the original data, because the database software would have rejected an invalid date.
    - But, sometimes, the extraction software made a mistake. Say, for example, used a wrong date format (DD/MM/YYYY instead of MM/DD/YYYY, or the other way around, or something else, possibly YYYY/MM/DD). In an ideal world, these things should not happen, but in such ideal world, my data consistency department should be useless. And the fact is that the world is not ideal. For various reasons, it is much easier to analyze the data issues when we first remove obvious inconsistencies in the input data and run our comparison software on cleaner data, rather than comparing the raw files.

    Most often, I only check what you called a date format.

    As you said, something like:

    Code:
    /\d{1,2}\/\d{1,2}\/\d{4}/
    will do the job. But it will not detect the difference between European and US format. The last day of last year could be:
    31/12/2012 or
    12/31/2012.

    Your regex does not make a difference between common European and US format.

    The regex provided by the OP, as I corrected it:

    Code:
    $startDate =~ /[0-1][0-9]\/[0-3][0-9]\/[1-2][0-9]{3}/;
    would detect this error: 31/12/2012 will fail the test, while 12/31/2012 will not. That is a huge difference for our processes, because I get the error early, even before I run the comparison.

    In brief, I agree with your case in theory, but reality is often more complex.

    I said that I am working as a consultant for a department with 14 years of cumulated experience (by a number of people, in total something in the tune of 750 man-year of experience) in this area. Personally, I have been with this department for only slightly less than three years. But my own previous experience also involved heavy data quality work and includes several major software migrations and three currency changes (euro migration) for companies which were in each case among the top five or top ten companies in their respective countries. Just to name a few: Deutsche Telekom, British Telecom, Orange (France and two other smaller countries), Verizon, Bouygues Telecom, and also some smaller projects in Spain, Norway, Greece, the Czek Republic, Pakistan, Turkey, Lebanon, and a few others. And my previous personal experience just matches that of the department I am working for now: before you run an actual data comparison, you need to clean the data, so that really messy things are eliminated early and don't pollute the actual comparison. Simple regexes on date are just possibly a part of that policy.

IMN logo majestic logo threadwatch logo seochat tools logo