#1
  1. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    4
    Rep Power
    0

    Finding several consecutive sessions with the same users


    Hello,
    I have a log of users' sessions, each session is defined by two users' IDs which are unique. Below is a sample of such a log->

    roomID,leaderID,start-time,finish-time,userA,userB
    20063,538,15:00:02,15:00:14, 1631463751, 1782130655
    20458,478,15:03:24,15:04:21, 1631463751, 1782130655
    20518,478,14:47:24,14:48:37, 1788114654, 1756771250
    20528,478,14:43:35,15:07:40, 15731449767,17655415066
    20528,478,14:44:56,14:46:29, 1632300508, 1522909598
    20528,478,14:51:30,14:56:17, 1522909598, 1632300508
    20528,478,14:52:44,14:55:21, 1633889820, 687191220


    I marked the users IDs which had consecutive sessions(no sessions with other users were between), the trickiest thing is that A and B users both can initiate sessions and I need to find them all.

    What I need?
    I need to retrieve those lines where couple of users established several(n>1) consecutive sessions between each other and that those users didn't communicate with other users between these sessions. Also there is a time limit condition - the period between finish-time of the previous session and the "start-time" of the following shouldn't exceed 75 sec.

    I would be very thankful to you for any help!
  2. #2
  3. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    I don't understand what you want to extract.

    Can you please show what you want to extract on your example, explaining why.
  4. #3
  5. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    4
    Rep Power
    0
    Originally Posted by Laurent_R
    I don't understand what you want to extract.

    Can you please show what you want to extract on your example, explaining why.
    20063,538,15:00:02,15:00:14, 1631463751, 1782130655
    20458,478,15:03:24,15:04:21, 1631463751, 1782130655

    20518,478,14:47:24,14:48:37, 1788114654, 1756771250
    20528,478,14:43:35,15:07:40, 15731449767,17655415066
    20528,478,14:44:56,14:46:29, 1632300508, 1522909598
    20528,478,14:51:30,14:56:17, 1522909598, 1632300508

    20528,478,14:52:44,14:55:21, 1633889820, 687191220
    20528,478,14:44:57,14:46:11, 1632300508, 1522909598 session first with usersA and B
    20528,478,14:46:30,14:56:17, 1632300508, 1752301123 session second the user A has established session with the other user
    20528,478,14:56:44,14:58:21, 1632300508, 1522909598 the second sessions between users A and B


    I want basically extract green lines from the file - these are sessions between two users who opened and reopend the sessions within several seconds.
    In contrast at the bottom is another example where sessions were not consecutive as between reopenning of the session user 1632300508 called user 1752301123.
  6. #4
  7. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    OK, a couple of questions.

    Suppose you have this in your file:

    20063,538,15:00:02,15:00:14, 1631463751, 1782130655
    20518,478,14:47:24,14:48:37, 1788114654, 1756771250
    20458,478,15:03:24,15:04:21, 1631463751, 1782130655

    I would assume that the fact that there is a session with other two users between the sessions between users 1631463751 and 1782130655 does not mean that don't want to pick up the sessions between the users in green above. Isd this correct?

    Second question: how big is your file? The underlying reason for this question is whether it can fit in memory (say in a hash)? If yes, I would probably build a hash of arrays in which the key would be one user and the array contain all the sessions of that user sorted in chronological order. It would then be easy to extract what you need. If it is too big to fit in memory, then you can probably still do it with iteration, but the algorithm might be more complex.
  8. #5
  9. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    4
    Rep Power
    0
    Originally Posted by Laurent_R
    OK, a couple of questions.

    Suppose you have this in your file:

    20063,538,15:00:02,15:00:14, 1631463751, 1782130655
    20518,478,14:47:24,14:48:37, 1788114654, 1756771250
    20458,478,15:03:24,15:04:21, 1631463751, 1782130655

    I would assume that the fact that there is a session with other two users between the sessions between users 1631463751 and 1782130655 does not mean that don't want to pick up the sessions between the users in green above. Isd this correct?

    Second question: how big is your file? The underlying reason for this question is whether it can fit in memory (say in a hash)? If yes, I would probably build a hash of arrays in which the key would be one user and the array contain all the sessions of that user sorted in chronological order. It would then be easy to extract what you need. If it is too big to fit in memory, then you can probably still do it with iteration, but the algorithm might be more complex.
    Yes correct if the sessions between happen to be consecutive then they have to be captured. Roughly speaking I just need to find call-backs.
    My file is 12 Mb size file.
  10. #6
  11. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    Hi,

    I tried to prepare quick a solution, but it took me longer than I thought, and I do not have time to complete it right now. It is now later than half past midnight in my time zone, and I really have to go to sleep if I want to get at work in due time tomorrow. I give you now what I have, because I will not be able to work on it for probably 20 hours, perhaps longer.

    What I have done is to read the file, store the data into a hash of arrays of arrays, sorted the arrays by the start time per user, so that, in principle, you only need to read the data structure of each user to figure out if that user opened a second connection with the same user immlediately after the first connection. I think it should get you relatively close to a solution, but there are still a few things not really working properly. I am not sure, after all, whether defining a main_user and a second_user is the best solution.

    Well, this is really not complete, but I hope it will help you going. This is quite short code, but pretty dense and does quite a bit.

    Perl Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
     
    my %user;
     
    while (<DATA>) {
    	chomp;
    	s/\r//g; # removing Windows carriage return is any
    	s/\s+//g; # removing white spaces
    	my (undef, undef, $start_time, $end_time, @user_list) = split /,/;
    	my ($main_user, $second_user) = sort @user_list; # main user: with the smallest id
    	push @{$user{$main_user}}, [$start_time, $end_time, $second_user];
    }
     
    # sorting the individual arrays by date
    foreach ( keys %user) {
    	my @temp_array = @{$user{$_}};
    	@{$user{$_}} = sort {$a->[0] cmp $b->[0]} @temp_array;
    }
     
    __DATA__
    20063,538,15:00:02,15:00:14, 1631463751, 1782130655
    20458,478,15:03:24,15:04:21, 1631463751, 1782130655
    20518,478,14:47:24,14:48:37, 1788114654, 1756771250
    20528,478,14:43:35,15:07:40, 15731449767,17655415066
    20528,478,14:44:56,14:46:29, 1632300508, 1522909598
    20528,478,14:51:30,14:56:17, 1522909598, 1632300508
    20528,478,14:52:44,14:55:21, 1633889820, 687191220


    At this point, you have the following data structure:

    Code:
    0  1633889820
    1  ARRAY(0x8035c9b0)
       0  ARRAY(0x8035c9c8)
          0  '14:52:44'
          1  '14:55:21'
          2  687191220
    2  1631463751
    3  ARRAY(0x8006c0e0)
       0  ARRAY(0x8006c110)
          0  '15:00:02'
          1  '15:00:14'
          2  1782130655
       1  ARRAY(0x8006c0b0)
          0  '15:03:24'
          1  '15:04:21'
          2  1782130655
    4  15731449767
    5  ARRAY(0x80368a88)
       0  ARRAY(0x8032ce98)
          0  '14:43:35'
          1  '15:07:40'
          2  17655415066
    6  1756771250
    7  ARRAY(0x80304b98)
       0  ARRAY(0x80304bf8)
          0  '14:47:24'
          1  '14:48:37'
          2  1788114654
    8  1522909598
    9  ARRAY(0x803689e0)
       0  ARRAY(0x803689b0)
          0  '14:44:56'
          1  '14:46:29'
          2  1632300508
       1  ARRAY(0x80368878)
          0  '14:51:30'
          1  '14:56:17'
          2  1632300508
    You just need to read it, add the time difference thing (actually, I would probably change the times to times stamps upfront, when reading the file, changing for example 12:23:42 to 12 * 60 * 60 + 23 * 60 + 42, so that date comparison become simple numeric comparisons.

    Please don't hesitate to ask if you need further infirmation.
  12. #7
  13. No Profile Picture
    Contributing User
    Devshed Novice (500 - 999 posts)

    Join Date
    Jun 2012
    Posts
    828
    Rep Power
    496
    Hi,

    I have a full script now:

    Perl Code:
    #!/usr/bin/perl
     
    use strict;
    use warnings;
     
    my %user;
     
    while (<DATA>) {
    	chomp;
    	s/\r//g; # removing Windows carriage return is any
    	s/\s+//g; # removing white spaces
    	my (undef, undef, $start_time, $end_time, @user_list) = split /,/;
    	$start_time = make_timestamp ($start_time);
    	$end_time = make_timestamp ($end_time);
    	my ($main_user, $second_user) = sort @user_list; # main user: with the smallest id
    	push @{$user{$main_user}}, [$start_time, $end_time, $second_user];
    }
     
    # sorting the individual arrays by date
    foreach ( keys %user) {
    	my @temp_array = @{$user{$_}};
    	# @{$user{$_}} = sort {$a->[0] cmp $b->[0]} @temp_array; not needed unless further processing necessary
    	@temp_array = sort {$a->[0] <=> $b->[0]} @temp_array;
    	my ($previous_second_user, $previous_end) = (0, 0);
    	foreach my $array_ref (@temp_array) {
    		my $current_second_user = $array_ref->[2];
    		if ($current_second_user eq $previous_second_user) {
    			my $current_start = $array_ref->[1];
    			if ($current_start - $previous_end < 75) {
    				print "got a match: print whatever you want:$_ @$array_ref \n";
    			}
    		}
    		$previous_end = $array_ref->[2];
    		$previous_second_user = $current_second_user;
    	}
    }
     
    sub make_timestamp {
    	my ($h, $m, $s) = split /:/, shift;
    	return 3600 * $h + 60 * $m + $s;
    }
    __DATA__
    20063,538,15:00:02,15:00:14, 1631463751, 1782130655
    20458,478,15:01:24,15:04:21, 1631463751, 1782130655
    20518,478,14:47:24,14:48:37, 1788114654, 1756771250
    20528,478,14:43:35,15:07:40, 15731449767,17655415066
    20528,478,14:44:56,14:46:29, 1632300508, 1522909598
    20528,478,14:51:30,14:56:17, 1522909598, 1632300508
    20528,478,14:52:44,14:55:21, 1633889820, 687191220


    It seems to be working more or less properly, as I get the follwing out put on the data above:

    Code:
    $ perl  connections.pl
    got a match: print whatever you want:1631463751 54084 54261 1782130655
    got a match: print whatever you want:1522909598 53490 53777 1632300508
    which is what I expected.

    There is however one edge case not covered, because you did not say what should happen in such a case. Suppose userA and userB have 2 connections, the first one finishing at 15:00:00 and the next one starting at 15:01:00. UseA has no other connection inbetween, so the second connection qualifies to be detected from the standpoint of userA. But what if usedB has another short connection with someone else in between?

    The code above does not try to tackle such a case, since I do not know what the program should do in such a case.

    I leave it to you to solve this issue.
  14. #8
  15. No Profile Picture
    Registered User
    Devshed Newbie (0 - 499 posts)

    Join Date
    Jun 2013
    Posts
    4
    Rep Power
    0
    Originally Posted by Laurent_R
    Hi,

    I have a full script now:

    Perl Code:
    #!/usr/bin/perl
     
    use strict;
    use warnings;
     
    my %user;
     
    while (<DATA>) {
    	chomp;
    	s/\r//g; # removing Windows carriage return is any
    	s/\s+//g; # removing white spaces
    	my (undef, undef, $start_time, $end_time, @user_list) = split /,/;
    	$start_time = make_timestamp ($start_time);
    	$end_time = make_timestamp ($end_time);
    	my ($main_user, $second_user) = sort @user_list; # main user: with the smallest id
    	push @{$user{$main_user}}, [$start_time, $end_time, $second_user];
    }
     
    # sorting the individual arrays by date
    foreach ( keys %user) {
    	my @temp_array = @{$user{$_}};
    	# @{$user{$_}} = sort {$a->[0] cmp $b->[0]} @temp_array; not needed unless further processing necessary
    	@temp_array = sort {$a->[0] <=> $b->[0]} @temp_array;
    	my ($previous_second_user, $previous_end) = (0, 0);
    	foreach my $array_ref (@temp_array) {
    		my $current_second_user = $array_ref->[2];
    		if ($current_second_user eq $previous_second_user) {
    			my $current_start = $array_ref->[1];
    			if ($current_start - $previous_end < 75) {
    				print "got a match: print whatever you want:$_ @$array_ref \n";
    			}
    		}
    		$previous_end = $array_ref->[2];
    		$previous_second_user = $current_second_user;
    	}
    }
     
    sub make_timestamp {
    	my ($h, $m, $s) = split /:/, shift;
    	return 3600 * $h + 60 * $m + $s;
    }
    __DATA__
    20063,538,15:00:02,15:00:14, 1631463751, 1782130655
    20458,478,15:01:24,15:04:21, 1631463751, 1782130655
    20518,478,14:47:24,14:48:37, 1788114654, 1756771250
    20528,478,14:43:35,15:07:40, 15731449767,17655415066
    20528,478,14:44:56,14:46:29, 1632300508, 1522909598
    20528,478,14:51:30,14:56:17, 1522909598, 1632300508
    20528,478,14:52:44,14:55:21, 1633889820, 687191220


    It seems to be working more or less properly, as I get the follwing out put on the data above:

    Code:
    $ perl  connections.pl
    got a match: print whatever you want:1631463751 54084 54261 1782130655
    got a match: print whatever you want:1522909598 53490 53777 1632300508
    which is what I expected.

    There is however one edge case not covered, because you did not say what should happen in such a case. Suppose userA and userB have 2 connections, the first one finishing at 15:00:00 and the next one starting at 15:01:00. UseA has no other connection inbetween, so the second connection qualifies to be detected from the standpoint of userA. But what if usedB has another short connection with someone else in between?

    The code above does not try to tackle such a case, since I do not know what the program should do in such a case.

    I leave it to you to solve this issue.


    Thanks a lot Laurent for your support! it was invaluable.
    And have a good day!

IMN logo majestic logo threadwatch logo seochat tools logo