#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    3
    Rep Power
    0

    Red face Help! how to calculate the similarity of two users in facebook?


    I am working on a project about data mining. my company has given me 6 million dummy customer info of facebook. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

    Problem : I use the status info & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & statuses of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

    For example:
    user A's hashtag = {cat, bull, cow, chicken, duck}
    user B's hashtag ={cat, chicken, cloth}
    user C's hashtag = {lenovo, Hp, Sony}

    clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?
  2. #2
  3. I <3 ASCII
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Aug 2003
    Posts
    2,400
    Rep Power
    1233
    If I were doing this I would probably turn this project into a two pass process:

    Create a hashtag map identifying related hashtags into categories

    I'd start by identifying general categories within the data. If a user tagged 'cat' what percentage of the time did they also use the hashtag 'cow'.

    At a certain point you'll hopefully be able to group hashtags into a category: with the hashtags you supplied you'd end up with a farming tag group and a technology tag group.

    Depending on how much time/computing power/raw manpower you have you can either make a fairly naive list, or something that identifies nuanced groups (pro vs college football, apple vs pc vs android, animal farming vs crop farming).

    As this list becomes more and more refined you can run the next step as necessary to identify connections.

    Create a score for how interested someone is in an interest based on their hashtags

    Keep in mind that the tag '#SEC' means something completely different based on whether the person in into football or investment banking. By looking at their tags as a whole you can get a pretty good idea of their top few interests. From here you can create either a score or a likelyhood for each category. From here it should be trivial to identify users with the same core interests.

    Once your mapping has been created (not a simple task I know) analysis of individual users should be fairly quick to do. Additionally having a hashtag map will allow for people with different hashtags in the same category to be joined. (ie cat vs kitten, phone vs mobile, cattle vs livestock).

    -MBirchmeier
    P.S. I don't know if I really answered your question, but this topic seems interesting, so I mulled it over how I'd do it for a bit.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Nov 2012
    Posts
    3
    Rep Power
    0
    hi, MBirchmeie Thank you very much for your good suggestions.

    I will try to understand your way how to implement it.

    In fact, the idea is there, the only thing need to do is to modify the data structure to make it feasible to implement with this kind of huge data, my main focus here is the time complexity and space complexity concern. The main purpose here is to find the related similarity between all the users, but in real life(in facebook), we may know only a very very small portion, it means there are actually a very very small group who have somewhat similarity with a particular person, if I directly calculate the similarity between each user, it will go through a lot of unnecessary similarity calculation, which waste a lot of time. So that I thought of filtering out all those unrelated users first, and find out that so-called related group first. The final step is to calculate the similarity between a particular person with his related group only, this way may be much faster than directly calculating the similarity. Use normal data structure:

    Step 1: use an array to store the 6 million(6m)users
    use link list to chain all the related users to a particular user

    Problem: Search time complexity is huge. Time complexity = O(6m*6m*K)
    K is the number of distinct hashtags a particular user has.

    1. select a particular user, search all other users(6m)
    2. there are 6 million users (6m)
    3. each user may have a large amount of distinct hashtags (K)

    From the analysis above, using normal data structure is infeasible. I will try to figure out if your way is better to implement, if so, I will use your idea to implement it. Thank you very much again.


    Originally Posted by MBirchmeier
    If I were doing this I would probably turn this project into a two pass process:

    Create a hashtag map identifying related hashtags into categories

    I'd start by identifying general categories within the data. If a user tagged 'cat' what percentage of the time did they also use the hashtag 'cow'.

    At a certain point you'll hopefully be able to group hashtags into a category: with the hashtags you supplied you'd end up with a farming tag group and a technology tag group.

    Depending on how much time/computing power/raw manpower you have you can either make a fairly naive list, or something that identifies nuanced groups (pro vs college football, apple vs pc vs android, animal farming vs crop farming).

    As this list becomes more and more refined you can run the next step as necessary to identify connections.

    Create a score for how interested someone is in an interest based on their hashtags

    Keep in mind that the tag '#SEC' means something completely different based on whether the person in into football or investment banking. By looking at their tags as a whole you can get a pretty good idea of their top few interests. From here you can create either a score or a likelyhood for each category. From here it should be trivial to identify users with the same core interests.

    Once your mapping has been created (not a simple task I know) analysis of individual users should be fairly quick to do. Additionally having a hashtag map will allow for people with different hashtags in the same category to be joined. (ie cat vs kitten, phone vs mobile, cattle vs livestock).

    -MBirchmeier
    P.S. I don't know if I really answered your question, but this topic seems interesting, so I mulled it over how I'd do it for a bit.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Oct 2012
    Location
    Croatia
    Posts
    65
    Rep Power
    2
    Where your firm get datas for that people?
  8. #5
  9. I <3 ASCII
    Devshed Regular (2000 - 2499 posts)

    Join Date
    Aug 2003
    Posts
    2,400
    Rep Power
    1233
    The O(n) of the determining the interest groups is likely to be your biggest issue.

    I'd guess you're looking at around n^3, based on the number of unique hashtags in the system.

    Once you have your interest groups though you're looking at a much easier task. Lets say you've got 100 interest groups identified You're looking at O(n) (6m users * 100 scores) for scoring all of the users.

    After that you're looking at an O(n^2) for matching users, but the n is much smaller. First you can remove all users below a certain score level, say a user has a 5% chance of being in a certain interest group, you're looking at (300k(users in group) * 300k * 100(number of groups).

    Even this is a pessimistic result IMO. Since you have a score you can sort the users in a group (n log n) then just compare them to the closest X in either direction.

    -MBirchmeier

IMN logo majestic logo threadwatch logo seochat tools logo