Perl Programming
 
Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
User Name:
Password:
Remember me
Go Back   Dev Shed ForumsProgramming LanguagesPerl Programming

Reply
Add This Thread To:
  Del.icio.us   Digg   Google   Spurl   Blink   Furl   Simpy   Y! MyWeb 
Thread Tools Search this Thread Rate Thread Display Modes
 
Unread Dev Shed Forums Sponsor:
  #1  
Old September 11th, 2006, 03:24 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
Search PDF's and return Match w/ PDF file link

I've been looking around and not too sure how to do this.
I have hundreds of PDF files that are inconsitent in format. I need to create a search engine that searches an entire pdf file thoroughly and other PDF files, returning matches along with a link to the PDF file.

Any particular modules I will need?
Actually if possible, I rather do this with Text Files. I can convert all my PDF's to textfiles if it makes things easier. I know I have Word Docs as well.
It is a lot of trouble installing new modules.

Thanks.

So far I have this which only reads a particular file and outputs it.

open (MYFILE, textfile.txt');
while (<MYFILE>) {
chomp;
print "$_\n";
}
close (MYFILE);

Last edited by sushi23 : September 12th, 2006 at 06:40 PM.

Reply With Quote
  #2  
Old September 11th, 2006, 07:07 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
help

Reply With Quote
  #3  
Old September 11th, 2006, 07:38 PM
Axweildr's Avatar
Axweildr Axweildr is offline
'fie' on me, allege-dly
Dev Shed God 15th Plane (12000 - 12499 posts)
 
Join Date: Mar 2003
Location: in da kitchen ...
Posts: 12,375 Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)  Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 4 Months 1 Week 4 Days 9 h 3 m 8 sec
Reputation Power: 4951
Send a message via Google Talk to Axweildr
Orkut
http://www.perlfect.com <--answer
__________________
--Ax
without exception, there is no rule ...
Heavy Haulage Ireland
Targeted Advertising Cookie Optout (TACO) extension for Firefox
The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones


09 F9 11 02
9D 74 E3 5B
D8 41 56 C5
63 56 88 C0
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski
Detavil - the devil is in the detail, allegedly, and I use the term advisedly, allegedly ... oh, no, wait I did ...

Reply With Quote
  #4  
Old September 11th, 2006, 08:09 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
I would pay them but I don't have too much time...oh well...i'll keep searching.

Reply With Quote
  #5  
Old September 11th, 2006, 08:18 PM
Axweildr's Avatar
Axweildr Axweildr is offline
'fie' on me, allege-dly
Dev Shed God 15th Plane (12000 - 12499 posts)
 
Join Date: Mar 2003
Location: in da kitchen ...
Posts: 12,375 Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)  Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 4 Months 1 Week 4 Days 9 h 3 m 8 sec
Reputation Power: 4951
Send a message via Google Talk to Axweildr
Orkut
http://www.perlfect.com/freescripts/, last I heard it still indexed PDF's, even the free version ...

Reply With Quote
  #6  
Old September 12th, 2006, 05:19 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
I already have a interface built on my own that searches against a database. I need to have it read a directory of pdfs to match results. I'm am very much a beginner at this. Any sample codes would help or direction.

Reply With Quote
  #7  
Old September 12th, 2006, 06:55 PM
Axweildr's Avatar
Axweildr Axweildr is offline
'fie' on me, allege-dly
Dev Shed God 15th Plane (12000 - 12499 posts)
 
Join Date: Mar 2003
Location: in da kitchen ...
Posts: 12,375 Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)  Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 4 Months 1 Week 4 Days 9 h 3 m 8 sec
Reputation Power: 4951
Send a message via Google Talk to Axweildr
Orkut
have a look at the indexer.pl script, and search for the PDF functionality

Reply With Quote
  #8  
Old September 12th, 2006, 09:07 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
wow that is extremely complicated.

Reply With Quote
  #9  
Old September 12th, 2006, 09:43 PM
Axweildr's Avatar
Axweildr Axweildr is offline
'fie' on me, allege-dly
Dev Shed God 15th Plane (12000 - 12499 posts)
 
Join Date: Mar 2003
Location: in da kitchen ...
Posts: 12,375 Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)  Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 4 Months 1 Week 4 Days 9 h 3 m 8 sec
Reputation Power: 4951
Send a message via Google Talk to Axweildr
Orkut
Turns out they use pdftotext anyhow.

Also it shells to antiword for word documents

Last edited by Axweildr : September 12th, 2006 at 09:46 PM.

Reply With Quote
  #10  
Old September 13th, 2006, 05:53 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
Good News. I got the script to read a directory of files and print all it's contents. Small steps at a time

Now I need to create a search.

Reply With Quote
  #11  
Old September 13th, 2006, 08:23 PM
Axweildr's Avatar
Axweildr Axweildr is offline
'fie' on me, allege-dly
Dev Shed God 15th Plane (12000 - 12499 posts)
 
Join Date: Mar 2003
Location: in da kitchen ...
Posts: 12,375 Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)Axweildr User rank is General 59th Grade (Above 100000 Reputation Level)  Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1Folding Points: 144761 Folding Title: Super Ultimate Folder - Level 1
Time spent in forums: 4 Months 1 Week 4 Days 9 h 3 m 8 sec
Reputation Power: 4951
Send a message via Google Talk to Axweildr
Orkut
ah, the easy part

what ideas you got?

Reply With Quote
  #12  
Old September 14th, 2006, 02:05 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
Hmm...now I'm thinking about throwing the text file into an array and split by spacing or tab. I have tried both but they output the same. When I print [0] or [1], it grabs an entire sentence. I got to resolve this.

Anyways, even if I get that working the search will probably take a very long time. BUT, I have another idea. I want to add each letter's ascii code, add the entire word and have the script match the totals..THEN compare only the matched ascii totals, convert back to letters and match by word...this way it would speed things up..i hope.

How can I do this with Perl?

Reply With Quote
  #13  
Old September 14th, 2006, 02:25 PM
Ctb's Avatar
Ctb Ctb is offline
An Ominous Coward
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: Jan 2002
Posts: 4,425 Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0
That's not going to work. If you convert each character to ascii code and sum the codes there's no guarantee any two words won't have the same total. To the contrary, it's almost certain that a large number of different words will all sum to the same thing which will make the search massively inaccurate.

Reply With Quote
  #14  
Old September 14th, 2006, 02:32 PM
sushi23 sushi23 is offline
Contributing User
Dev Shed Newbie (0 - 499 posts)
 
Join Date: Sep 2005
Posts: 165 sushi23 User rank is Just a Lowly Private (1 - 20 Reputation Level) 
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 5
Quote:
Originally Posted by Ctb
That's not going to work. If you convert each character to ascii code and sum the codes there's no guarantee any two words won't have the same total. To the contrary, it's almost certain that a large number of different words will all sum to the same thing which will make the search massively inaccurate.


But wouldn't it cut down the next search alot?
I meant first search ascii then a second search with those ascii results but search them the normal way. Ok, if not this...any ideas?
I'm still having trouble splitting the contents by words in the file.

Last edited by sushi23 : September 14th, 2006 at 02:39 PM.

Reply With Quote
  #15  
Old September 14th, 2006, 02:37 PM
Ctb's Avatar
Ctb Ctb is offline
An Ominous Coward
Dev Shed Specialist (4000 - 4499 posts)
 
Join Date: Jan 2002
Posts: 4,425 Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level)Ctb User rank is Colonel (50000 - 60000 Reputation Level) 
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0

Reply With Quote
Reply

Viewing: Dev Shed ForumsProgramming LanguagesPerl Programming > Search PDF's and return Match w/ PDF file link


Thread Tools  Search this Thread 
Search this Thread:

Advanced Search
Display Modes  Rate This Thread 
Rate This Thread:


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
View Your Warnings | New Posts | Latest News | Latest Threads | Shoutbox
Forum Jump




 Free IT White Papers!
 
How to Present Effectively Online
This white paper offers practical and actionable advice on the key steps that any presenter should consider as they plan and execute a Webinar or online meeting.

 
Open Source Security Myths
Open Source Software (OSS) is computer software whose source code is available to the general public with relaxed or non-existent intellectual property restrictions (or arrangement such as the public domain), and is usually developed with the input of many contributors.

 
Power and Cooling Capacity Management for Data Centers
This paper describes the principles for achieving power and cooling capacity management.

 
Scalable, Fault-Tolerant NAS for Oracle - The Next Generation
For several years NAS has been evolving as a storage alternative for Oracle databases, and for good reason: NAS is quite often the simplest, most cost-effective storage approach for Oracle. Learn about the benefits that HP's approach to scalable NAS brings to Oracle environments in this comprehensive white paper.

 
Understanding Web Application Security Challenges
This white paper discusses many common threats and preventive measures for Web application security, and explains what you can do to help protect your organization.

 

Forums: » Register « |  User CP |  Games |  Calendar |  Members |  FAQs |  Sitemap |  Support | 
  
 




© 2003-2010 by Developer Shed. All rights reserved. DS Cluster 11 Hosted by Hostway
For more Enterprise Application Development news, visit eWeek