Discuss Search PDF's and return Match w/ PDF file link in the Perl Programming forum on Dev Shed. Search PDF's and return Match w/ PDF file link Perl Programming forum discussing coding in Perl, utilizing Perl modules, and other Perl-related topics. Perl, the Practical Extraction and Reporting Language, is the choice for many for parsing textual information.
The ASP Free website provides in-depth information on the latest developer tools available from Microsoft. Our cadre of writers, highly experienced industry experts, reveals the best ways to use established technologies as well as new and emerging technologies. Our coverage of Microsoft's development and administration technologies is among the most respected in the IT industry today.
ASP Free and Iron Speed Designer are giving away $5,500+ in FREE licenses. Iron Speed's RAD CASE toolset can save up to 80% of your coding time. One free license per week, one perpetual license per month! Download and Activate to enter!
Intel® Graphics Performance Analyzers is a powerful tool suite for analyzing and optimizing your games, media, and graphics-intensive applications. Used by some of the best developers on the planet, Intel GPA lets you maximize your app’s performance.
Posts: 165
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 7
Search PDF's and return Match w/ PDF file link
I've been looking around and not too sure how to do this.
I have hundreds of PDF files that are inconsitent in format. I need to create a search engine that searches an entire pdf file thoroughly and other PDF files, returning matches along with a link to the PDF file.
Any particular modules I will need?
Actually if possible, I rather do this with Text Files. I can convert all my PDF's to textfiles if it makes things easier. I know I have Word Docs as well.
It is a lot of trouble installing new modules.
Thanks.
So far I have this which only reads a particular file and outputs it.
open (MYFILE, textfile.txt');
while (<MYFILE>) {
chomp;
print "$_\n";
}
close (MYFILE);
Last edited by sushi23 : September 12th, 2006 at 06:40 PM.
The great thing about Object Oriented code is that it can make small, simple problems look like large, complex ones
09 F9 11 02
9D 74 E3 5B
D8 41 56 C5
63 56 88 C0
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Detavil - the devil is in the detail, allegedly, and I use the term advisedly, allegedly ... oh, no, wait I did ...
Posts: 165
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 7
I already have a interface built on my own that searches against a database. I need to have it read a directory of pdfs to match results. I'm am very much a beginner at this. Any sample codes would help or direction.
Posts: 165
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 7
Hmm...now I'm thinking about throwing the text file into an array and split by spacing or tab. I have tried both but they output the same. When I print [0] or [1], it grabs an entire sentence. I got to resolve this.
Anyways, even if I get that working the search will probably take a very long time. BUT, I have another idea. I want to add each letter's ascii code, add the entire word and have the script match the totals..THEN compare only the matched ascii totals, convert back to letters and match by word...this way it would speed things up..i hope.
Posts: 4,425
Time spent in forums: 3 Weeks 10 h
Reputation Power: 0
That's not going to work. If you convert each character to ascii code and sum the codes there's no guarantee any two words won't have the same total. To the contrary, it's almost certain that a large number of different words will all sum to the same thing which will make the search massively inaccurate.
Posts: 165
Time spent in forums: 1 Day 12 h 21 m 16 sec
Reputation Power: 7
Quote:
Originally Posted by Ctb
That's not going to work. If you convert each character to ascii code and sum the codes there's no guarantee any two words won't have the same total. To the contrary, it's almost certain that a large number of different words will all sum to the same thing which will make the search massively inaccurate.
But wouldn't it cut down the next search alot?
I meant first search ascii then a second search with those ascii results but search them the normal way. Ok, if not this...any ideas?
I'm still having trouble splitting the contents by words in the file.
Last edited by sushi23 : September 14th, 2006 at 02:39 PM.