#1
  1. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2003
    Location
    Cleveland OH USA
    Posts
    44
    Rep Power
    16

    Determine that a string is non-phonetic?


    Spammers attacking my system, and others that I've noticed as well, have been reliably using as certain field inputs character strings that are obviously non-phonetic, and usually 8 characters--we're talking strings like

    bdpqntkqij
    Dlwngygn
    Jmuzgtbf
    Eijkhbxn
    Jtwvbfdu
    Hvdllmer
    Gkjebtsi
    Hbchoaxy
    Xnxryxao
    Cqcywjim
    Pjactfis
    Ymsbbsbs
    Hvvyxcrl
    Umdjomor
    Wpqfiumu
    Gpeaomkm

    --that sort of thing. Is there any code that can detect non-phonetic strings so that they can be flagged as spam? I am aware of the soundex function but I'm not sure how it might be applied.
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jul 2003
    Posts
    4,473
    Rep Power
    653
    I doubt there is any foolproof methodology. I suggest you process it through a spell checker and if it is not found then reject the data. You may reject a few unnecessarily but if you keep an eye on it for a while and add the words you might come pretty close. Poor spellers will have problems if you use this methodology.
    Last edited by gw1500se; February 16th, 2014 at 11:22 AM.
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2003
    Location
    Cleveland OH USA
    Posts
    44
    Rep Power
    16
    I don't think spellchecking will work because this is on a name field, and most proper nouns aren't in spellcheck dictionaries. Most people's names, however, are still fairly phonetic: they have predictable patterns of consonant-vowel succession so as to form phones. The examples I posted would utterly fail the phonal requirement--you simply can't pronounce them because they're gibberish of mostly consonants that don't have a proper arrangement with interstitial vowels, and they have a high frequency of seldom-used consonants to boot.

    There must be a way to do this. I can't imagine this isn't a solved problem--looking at a string and asking, "Is this pronounceable or is it just somebody mashing down on the keyboard?"
  6. #4
  7. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1018
    Hi,

    no offense, but this is nonsense.

    Who are you to decide that a person has a “wrong name”? There's a world outside of the USA. Different cultures have different names, many of which you would probably classify as “gibberish” (lack of vowels etc.). Does that mean your site may only be used by people who happen to have a name which is common in your culture?

    I understand that this may look like an obvious solution to you, but it simply does not work. It discriminates legitimate users based on cultural ignorance, and it's a gigantic effort for a poor result.

    If you want spam protection, there are many established solutions which actually work:

    • Hidden fields are a primitive, but surprisingly effective method against average spam bots.
    • CAPTCHAs like Google's reCAPTCHA are another common approach. However, they're not barrier-free, which makes them problematic.
    • There are very powerful trainable content classifiers like CRM114 which recognize spammy texts with an extremely high success rate.
    • ...

    Comments on this post

    • Nilpo agrees
    Last edited by Jacques1; February 16th, 2014 at 02:00 PM.
    The 6 worst sins of securityHow to (properly) access a MySQL database with PHP

    Why can’t I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Feb 2014
    Location
    Sano, Japan
    Posts
    3
    Rep Power
    0
    Originally Posted by Jacques1
    Hi,
    If you want spam protection, there are many established solutions which actually work:
    I think anything you add yourself is likely to be effective, because it's a one-off. If your site is all about dogs, provide a select input to the form which offers (e.g.) and ask for a dog breed:

    Viagra
    XXyreewn
    Collie
    Catfood
    Genuine

    And accept only "Collie". Or whatever...

    Comments on this post

    • Jacques1 disagrees : Please read the question. This is about people's names.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2003
    Location
    Cleveland OH USA
    Posts
    44
    Rep Power
    16
    Jacques1, I understand that a few cultures have names that are light on the vowel side when transliterated into the Latin alphabet. Even some very common Western names have relatively few vowels: my own surname, one of the most common German surnames, is 7 letters and only one vowel.

    That said, your apparent indignation is unfounded. Naturally, I would hope that my sought solution would not be unduly disinclusive. And it would not need to be. As many strange names as there are out there, they do not share the letter-frequency and arrangement characteristics of those spam strings posted above. I don't think you can come up with examples of real names that resemble those spam strings. Even if you could, the probability that a person with such a name (8+ characters, nearly all consonants, some of them rare consonants, and arranged non-phonetically) using my system is infinitesimally small. And as we know, any spam system is bound to have some false positives requiring ad hoc solving by whitelist.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Specialist (4000 - 4499 posts)

    Join Date
    Jul 2003
    Posts
    4,473
    Rep Power
    653
    What about using a surname database?
    There are 10 kinds of people in the world. Those that understand binary and those that don't.
  14. #8
  15. No Profile Picture
    I haz teh codez!
    Devshed Frequenter (2500 - 2999 posts)

    Join Date
    Dec 2003
    Posts
    2,574
    Rep Power
    2343
    Don't know what you're running, but how about using a service like Stop Forum Spam?
    I ♥ ManiacDan & requinix

    This is a sig, and not necessarily a comment on the OP:
    Please don't be a help vampire!
  16. #9
  17. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1018
    @Robert:

    What I'm saying is that the approach is simply wrong. The whole idea of rejecting a human because they have a “wrong name” is just silly and pretty offensive.

    If you gonna stick to your “solution” no matter what, well, go ahead. But if you want honest advice: Don't do it. I already listed several actual solutions to the spam problem, and there are many others. You're working on the wrong end.
    The 6 worst sins of securityHow to (properly) access a MySQL database with PHP

    Why can’t I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  18. #10
  19. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,884
    Rep Power
    6356
    We've discussed name validation before. If your problem is spam bots, establish an anti-spam-bot tool like reCaptcha. Do not invalidate people's names. If your problem is human spammers filling out your form and correctly using the captcha, establish a strong mod policy and allow banning by IP and IP block. Do not invalidate people's names. There are solutions to your problem. Do not randomly choose a task and declare that it's the solution to your problem.
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  20. #11
  21. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Mar 2003
    Location
    Cleveland OH USA
    Posts
    44
    Rep Power
    16
    That's no discussion, that's you and Jacques yelling at some poor coder who came here asking an innocent code question.

    What people want to do with their web sites is their business. If someone wants to block visitors with accents in their names, that's their idiotic decision. I don't want to do that. I want to block spammers who for some reason have decided to use 8+ letter strings that are low on vowels, high on rare consonants like W, X, and Z, use consonant pairs and triplets that are never found in nature, and clearly exceed some threshold of unpronounceability which could probably be detected by a simple script that already exists. This is a major problem for me and it's not deserving of irrational scorn and moralizing. No one has named a single example of a real-world name that has the statistical or phonetic characteristics of the example strings I posted.

    You seem to think I'm out to make things frustrating for some ridiculously miniscule population of people with names like Gdmkhbjj, Ifvuwnqs, and Qqqoznky. I'm not. I just want to block 99.99% of the spam my site gets without having to make human users jump through ridiculous hoops of registration, captchas or other human checks which are (a) annoying to have to submit to when solving them takes more work than posting the comment intended and (b) constantly being defeated by improved machine technologies anyway. This decision is a rational choice for me as a webmaster. I'd rather lose the approximately 0% of real users who attempt to list their names as Xinjsycr and Bbovitjp than the much higher percentage of users who want to make a quick, valid contribution but would give up when confronted with the hassle of a registration or captcha. In my particular application, the types of input that are being submitted consist only of a few clicks and keystrokes, and repeat submitters are rare. For most people, it's not worth a registration or human check to make such a small submission.

    Thanks anyway. I'll inquire elsewhere.
    Last edited by Robert K S; February 18th, 2014 at 08:05 AM.
  22. #12
  23. --
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Jul 2012
    Posts
    3,959
    Rep Power
    1018
    Originally Posted by Robert K S
    I'll inquire elsewhere.
    If you're looking for a code drone to blindly implement your idea without asking questions, you should probably do that.

    I don't think that's the kind of people I would ask for advice, though ...
    The 6 worst sins of securityHow to (properly) access a MySQL database with PHP

    Why can’t I use certain words like "drop" as part of my Security Question answers?
    There are certain words used by hackers to try to gain access to systems and manipulate data; therefore, the following words are restricted: "select," "delete," "update," "insert," "drop" and "null".
  24. #13
  25. No Profile Picture
    Contributing User
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Sep 2006
    Posts
    2,122
    Rep Power
    539
    I am not endorsing or opposing Robert's idea.

    I do feel, however, that this should be a safe place to bring up and discuss novel ideas no mater how unorthodox they sound, and don't think Robert got a fair shot on this one.

    I once read when Larry Page and Sergey Brin came up with their concepts for Google, they were ridiculed almost to the point of tears. So, don't give up Robert!

    PS. I probably wouldn't do it, but it is an interesting idea.

    Comments on this post

    • Jacques1 disagrees : Not every rejected idea makes you a Galileo Galilei. I think we've explained in great detail why we think his approach doesn't work.
  26. #14
  27. Sarcky
    Devshed Supreme Being (6500+ posts)

    Join Date
    Oct 2006
    Location
    Pennsylvania, USA
    Posts
    10,884
    Rep Power
    6356
    Originally Posted by Robert K S
    That's no discussion, that's you and Jacques yelling at some poor coder who came here asking an innocent code question.
    Nobody is yelling. You came in and asked how to do something that's a bad idea. If this were a car enthusiast forum and you asked how to remove your airbags and replace them with frozen yogurt tubs, we'd tell you not to do that.

    Originally Posted by Robert K S
    What people want to do with their web sites is their business. If someone wants to block visitors with accents in their names, that's their idiotic decision.
    Similarly, whether or not we point out idiotic decisions is our business. If you'd like to join a community where people only reply the way you like, go start your own community.


    Originally Posted by Robert K S
    I don't want to do that. I want to block spammers who for some reason have decided to use 8+ letter strings that are low on vowels, high on rare consonants like W, X, and Z, use consonant pairs and triplets that are never found in nature, and clearly exceed some threshold of unpronounceability which could probably be detected by a simple script that already exists. This is a major problem for me and it's not deserving of irrational scorn and moralizing. No one has named a single example of a real-world name that has the statistical or phonetic characteristics of the example strings I posted.
    Let's say you do this. Let's say you spend the next few weeks writing a statistical analyzer that determines the difference between "Sluzhynskyy" (a guy I work with) and "Xnxryxao". So now you can successfully, with no false positives, kick out spammers who are using random characters. The spammer updates their software to use the freely available most popular names in the US list, and you're back to square one.


    Originally Posted by Robert K S
    You seem to think I'm out to make things frustrating for some ridiculously miniscule population of people with names like Gdmkhbjj, Ifvuwnqs, and Qqqoznky. I'm not. I just want to block 99.99% of the spam my site gets without having to make human users jump through ridiculous hoops
    Without having to make human users with names you approve of, you mean. But you don't seem to care about that.

    Originally Posted by Robert K S
    registration, captchas or other human checks which are (a) annoying to have to submit to when solving them takes more work than posting the comment intended and (b) constantly being defeated by improved machine technologies anyway
    If "I'm going to lose anyway so why bother even trying" is already your motto, why bother writing your name analyzer?

    Originally Posted by Robert K S
    For most people, it's not worth a registration or human check to make such a small submission.

    Thanks anyway. I'll inquire elsewhere.
    You're welcome to stay here, but when you ask experts how to do something that's the wrong solution, expect them to tell you it's the wrong solution.

    If all you want is to use us as an analog to google: no, there is no such pre-built function which analyzes consonant sequences. However, if you really must ignore the existing standards and use this idea of yours to prevent spam, you can do something like this:

    PHP Code:
    function checkUser$name ) {
      
    //if 4 non-vowels in a row:
      
    if ( preg_match'/[b-df-hj-np-tv-z]{4,}/i'$name ) ) return false;
      
    //if double-occurrence of an "uncommon" letter:
      
    if ( preg_match'/([qvwzx])\1/i'$name ) ) return false;
      
    //if Q not followed by U
      
    if ( preg_match'/q(?!u)/i'$name ) ) return false;
      
    //passed
      
    return true;

    Now unfortunately, given an input like this:
    PHP Code:
    $inputs = array(
      
    'Xnxryxao',
      
    'Sluzhynskyy',
      
    'Steve',
      
    'Ymsbbsbs',
      
    'Gpeaomkm',
      
    'Lizzy',
      
    'San Francisco',
      
    'Iraq',
    );
    foreach ( 
    $inputs as $name ) {
      echo 
    $name ' is ' . ( checkUser($name) ? 'VALID' 'invalid' ) . "\n";

    The output still shows that some things, including valid names, are being thrown out:
    Code:
    Xnxryxao is invalid
    Sluzhynskyy is invalid
    Steve is VALID
    Ymsbbsbs is invalid
    Gpeaomkm is VALID
    Lizzy is invalid
    San Francisco is VALID
    Iraq is invalid
    Now you at least have the starting point you were asking for. Good luck.
    Last edited by ManiacDan; February 19th, 2014 at 09:25 AM.
    HEY! YOU! Read the New User Guide and Forum Rules

    "They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin

    "The greatest tragedy of this changing society is that people who never knew what it was like before will simply assume that this is the way things are supposed to be." -2600 Magazine, Fall 2002

    Think we're being rude? Maybe you asked a bad question or you're a Help Vampire. Trying to argue intelligently? Please read this.
  28. #15
  29. Devshed Beginner (1000 - 1499 posts)

    Join Date
    Jan 2004
    Location
    New Springfield, OH
    Posts
    1,312
    Rep Power
    1503
    There is a fair bit of indignation coming from you if you're honest with yourself. You don't want to accept that the answer you seek is not the proper solution to your problem.
    Originally Posted by Robert K S
    What people want to do with their web sites is their business.
    Spot on, old chap! Stop making it ours if you won't listen to good reason.
    Originally Posted by Robert K S
    Thanks anyway. I'll inquire elsewhere.
    A fine idea. And when you get the answer you are looking for, consider NEVER visiting that site again because they will have given you very poor advice.

    Comments on this post

    • ManiacDan agrees : Somewhere there's an entire message board of people helping each other do terribly incorrect things.
    Don't like me? Click it.

    Scripting problems? Windows questions? Ask the Windows Guru!

    Stay up to date with all of my latest content. Follow me on Twitter!

    Help us help you! Post your exact error message with these easy tips!

IMN logo majestic logo threadwatch logo seochat tools logo