1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Rep Power

    Perplexing computer science problem -- how to handle it?

    This is a technical assessment question. The goal is to write a PHP program which reads three input files:

    subscribers.txt - a list of people subscribed to our newsletter
    unsubscribed.txt - a list of people who have unsubscribed
    bounced.txt - a list of addresses which have bounced in the past

    Each file is plain text and contains one email address per line.

    The program should consider the list of subscribers and remove anyone who has unsubscribed or bounced, since we don't want to send the newsletter to them. It should then output a list of the people who will receive the email.

    Now the key...
    How well will your program perform against a list of a million subscribers?

    My thoughts:
    1) In order to do the cross-checks between lists, it seems we'll have to have the entirety of at least one file in memory at one point in time

    2) We could parse the subscribers.txt file one line at a time with fgets.

    What tricks could we use to grab say, 600 thousand subscribers, and cross check against an equally large list of bounced emails and unsubscribed emails?
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Rep Power
    This is not a very difficult assignment but we do not do your work for you here, if you have a problem tell us and show us your code so we can find it

    My thoughts on how to do this are first read the subscribers file then put all of the contents into an array then close the file, next open up unsubscribed and put the data into an array then close the file then check if any data between the two arrays matches and delete that matching data from array one then unset array two, third do the same thing as you just did for the unsubscribed with the bounced file, and fourth create a new file with the appropriate name with the remaining data of array one and then close the newly created file then unset array one, then just display the data of the new file if needed

    To speed up the process you could do it in chunks instead of as a whole which means do the process but for only like 1/4th of the files and keep repeating until all of the files have been processed because trying to open and parse a huge file entirely will take a while but if you do it in parts at a time, it will speed up tremendously
    Last edited by jack13580; June 5th, 2013 at 01:33 AM.
  4. #3
  5. No Profile Picture
    Contributing User
    Devshed Loyal (3000 - 3499 posts)

    Join Date
    Dec 2004
    Rep Power
    you also have to note that even if you do it in chunks (or in one go), php time out errors will stop the script if the file is very huge.

    to solve this, use Command line.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Aug 2012
    Rep Power
    The problem is not reading the huge file but parsing the data in that huge file, php can easily read a 2.3GB file but if you try to put all of the contents of that massive file into an array, apache would probably crash

    Here's a quote from the fread section in the php manual website

    Originally Posted by matt at matt-darby dot com
    I thought I had an issue where fread() would fail on files > 30M in size. I tried a file_get_contents() method with the same results. The issue was not reading the file, but echoing its data back to the browser.

    Basically, you need to split up the filedata into manageable chunks before firing it off to the browser:

    PHP Code:

    $blocksize = (<< 20); //2M chunks
    $sent      0;
    $handle    fopen($filepath"r");
    // Now we need to loop through the file and echo out chunks of file data
    // Dumping the whole file fails at > 30M!
    while($sent $total){
    $sent += $blocksize;

    Hope this helps someone!
    Last edited by jack13580; June 5th, 2013 at 06:33 AM.
  8. #5
  9. Mad Scientist
    Devshed Expert (3500 - 3999 posts)

    Join Date
    Oct 2007
    North Yorkshire, UK
    Rep Power
    performance in these situations is generally not prioritised as they are batch processes that generally do not get run that often.

    What is more important is data integrity.

    Without much more thought, I'd use a SQL based database; turn my lists into tables and run a NOT IN type query.

    performance is much more important on repetative tasks - eg you have a list of 1 million subscribers and you want to mail to all of them, then track the opens, click, clicks on the unsubscribe link, etc...general behaviour patterns will give you a server load represented by a positively skewed normal distribution curve - where your server will have to cope with the peak demand in order to comply with legal regulations of honouring unsubscriptions
    I said I didn't like ORM!!! <?php $this->model->update($this->request->resources[0])->set($this->request->getData())->getData('count'); ?>

    PDO vs mysql_* functions: Find a Migration Guide Here

    [ Xeneco - T'interweb Development ] - [ Are you a Help Vampire? ] - [ Read The manual! ] - [ W3 methods - GET, POST, etc ] - [ Web Design Hell ]

IMN logo majestic logo threadwatch logo seochat tools logo