Software Design
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me

The Shed is going Social! Join us on FaceBook and Twitter and chime in on the conversation.

Go Back   Dev Shed ForumsProgramming Languages - MoreSoftware Design

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old December 2nd, 2012, 07:53 PM
ldaneil ldaneil is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 3 ldaneil User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 25 m 11 sec
Reputation Power: 0
Red face Help! how to calculate the similarity of two users in facebook?

I am working on a project about data mining. my company has given me 6 million dummy customer info of facebook. I was assigned to find out the similarity between any two users. can anyone could give me some ideas how to deal with the large community data? Thanks in advance

Problem : I use the status info & hashtag info(hashtags are those words highlighted by user) as the two criteria to measure the similarity between two different users. Since the large number of users, and especially there may be millions of hastags & statuses of each user. Can anyone tell me a good way to fast calculate the similarity between two users? I have tried to use FT-IDF to calculate the similarity between two different users, but it seems infeasible. can anyone have a very super algorithm or good ideas which could make me fast find all the similarities between users?

For example:
user A's hashtag = {cat, bull, cow, chicken, duck}
user B's hashtag ={cat, chicken, cloth}
user C's hashtag = {lenovo, Hp, Sony}

clearly, C has no relation with A, so it is not necessary to calculate the similarity to waste time, we may filter out all those unrelated user first before calculate the similarity. in fact, more than 90% of the total users are unrelated with a particular user. How to use hashtag as criteria to fast find those potential similar user group of A? is this a good idea? or we just directly calculate the relative similarity between A and all other users? what algorithm would be the fastest and customized algorithm for the problem?

Reply With Quote
  #2  
Old December 3rd, 2012, 03:28 PM
MBirchmeier's Avatar
MBirchmeier MBirchmeier is offline
I <3 ASCII
Dev Shed Regular (2000 - 2499 posts)
 
Join Date: Aug 2003
Posts: 2,395 MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 2 Days 18 h 33 m 9 sec
Reputation Power: 1231
Send a message via AIM to MBirchmeier
If I were doing this I would probably turn this project into a two pass process:

Create a hashtag map identifying related hashtags into categories

I'd start by identifying general categories within the data. If a user tagged 'cat' what percentage of the time did they also use the hashtag 'cow'.

At a certain point you'll hopefully be able to group hashtags into a category: with the hashtags you supplied you'd end up with a farming tag group and a technology tag group.

Depending on how much time/computing power/raw manpower you have you can either make a fairly naive list, or something that identifies nuanced groups (pro vs college football, apple vs pc vs android, animal farming vs crop farming).

As this list becomes more and more refined you can run the next step as necessary to identify connections.

Create a score for how interested someone is in an interest based on their hashtags

Keep in mind that the tag '#SEC' means something completely different based on whether the person in into football or investment banking. By looking at their tags as a whole you can get a pretty good idea of their top few interests. From here you can create either a score or a likelyhood for each category. From here it should be trivial to identify users with the same core interests.

Once your mapping has been created (not a simple task I know) analysis of individual users should be fairly quick to do. Additionally having a hashtag map will allow for people with different hashtags in the same category to be joined. (ie cat vs kitten, phone vs mobile, cattle vs livestock).

-MBirchmeier
P.S. I don't know if I really answered your question, but this topic seems interesting, so I mulled it over how I'd do it for a bit.
__________________
My fiancee's transition from accountant to writer
0x4279 7465 204D 6521

Reply With Quote
  #3  
Old December 3rd, 2012, 09:07 PM
ldaneil ldaneil is offline
Registered User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Nov 2012
Posts: 3 ldaneil User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 25 m 11 sec
Reputation Power: 0
hi, MBirchmeie Thank you very much for your good suggestions.

I will try to understand your way how to implement it.

In fact, the idea is there, the only thing need to do is to modify the data structure to make it feasible to implement with this kind of huge data, my main focus here is the time complexity and space complexity concern. The main purpose here is to find the related similarity between all the users, but in real life(in facebook), we may know only a very very small portion, it means there are actually a very very small group who have somewhat similarity with a particular person, if I directly calculate the similarity between each user, it will go through a lot of unnecessary similarity calculation, which waste a lot of time. So that I thought of filtering out all those unrelated users first, and find out that so-called related group first. The final step is to calculate the similarity between a particular person with his related group only, this way may be much faster than directly calculating the similarity. Use normal data structure:

Step 1: use an array to store the 6 million(6m)users
use link list to chain all the related users to a particular user

Problem: Search time complexity is huge. Time complexity = O(6m*6m*K)
K is the number of distinct hashtags a particular user has.

1. select a particular user, search all other users(6m)
2. there are 6 million users (6m)
3. each user may have a large amount of distinct hashtags (K)

From the analysis above, using normal data structure is infeasible. I will try to figure out if your way is better to implement, if so, I will use your idea to implement it. Thank you very much again.


Quote:
Originally Posted by MBirchmeier
If I were doing this I would probably turn this project into a two pass process:

Create a hashtag map identifying related hashtags into categories

I'd start by identifying general categories within the data. If a user tagged 'cat' what percentage of the time did they also use the hashtag 'cow'.

At a certain point you'll hopefully be able to group hashtags into a category: with the hashtags you supplied you'd end up with a farming tag group and a technology tag group.

Depending on how much time/computing power/raw manpower you have you can either make a fairly naive list, or something that identifies nuanced groups (pro vs college football, apple vs pc vs android, animal farming vs crop farming).

As this list becomes more and more refined you can run the next step as necessary to identify connections.

Create a score for how interested someone is in an interest based on their hashtags

Keep in mind that the tag '#SEC' means something completely different based on whether the person in into football or investment banking. By looking at their tags as a whole you can get a pretty good idea of their top few interests. From here you can create either a score or a likelyhood for each category. From here it should be trivial to identify users with the same core interests.

Once your mapping has been created (not a simple task I know) analysis of individual users should be fairly quick to do. Additionally having a hashtag map will allow for people with different hashtags in the same category to be joined. (ie cat vs kitten, phone vs mobile, cattle vs livestock).

-MBirchmeier
P.S. I don't know if I really answered your question, but this topic seems interesting, so I mulled it over how I'd do it for a bit.

Reply With Quote
  #4  
Old December 4th, 2012, 04:00 AM
Robotower Robotower is offline
Contributing User
Click here for more information
 
Join Date: Oct 2012
Location: Croatia
Posts: 66 Robotower User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 9 h 58 m 54 sec
Reputation Power: 1
Where your firm get datas for that people?

Reply With Quote
  #5  
Old December 7th, 2012, 10:20 AM
MBirchmeier's Avatar
MBirchmeier MBirchmeier is offline
I <3 ASCII
Dev Shed Regular (2000 - 2499 posts)
 
Join Date: Aug 2003
Posts: 2,395 MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level)MBirchmeier User rank is General 4th Grade (Above 100000 Reputation Level) 
Time spent in forums: 1 Month 2 Weeks 2 Days 18 h 33 m 9 sec
Reputation Power: 1231
Send a message via AIM to MBirchmeier
The O(n) of the determining the interest groups is likely to be your biggest issue.

I'd guess you're looking at around n^3, based on the number of unique hashtags in the system.

Once you have your interest groups though you're looking at a much easier task. Lets say you've got 100 interest groups identified You're looking at O(n) (6m users * 100 scores) for scoring all of the users.

After that you're looking at an O(n^2) for matching users, but the n is much smaller. First you can remove all users below a certain score level, say a user has a 5% chance of being in a certain interest group, you're looking at (300k(users in group) * 300k * 100(number of groups).

Even this is a pessimistic result IMO. Since you have a score you can sort the users in a group (n log n) then just compare them to the closest X in either direction.

-MBirchmeier

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming Languages - MoreSoftware Design > Help! how to calculate the similarity of two users in facebook?

Developer Shed Advertisers and Affiliates



Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 


Powered by: vBulletin Version 3.0.5
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.

© 2003-2013 by Developer Shed. All rights reserved. DS Cluster - Follow our Sitemap