The simple search engine on Perl

The standard cgi (common gateway interface) has initially been developed to enable users to start the programs accessible on the server through the Web. The first cgi-programs served as the simple interface for standard grep commands and finger, transformed the information which is given out by these commands, in a format html and passed the received results to a browser of the user.


cgi-programs and the other programs which are carried out by the server, since then were considerably complicated. But one of their possible{probable} applications does not lose the urgency in due course. It is an opportunity of search among the documents stored{kept} on a website, on a keyword or a line. If search engines (now them name portals) make possible search on all network the Internet among huge quantity{amount} of servers cgi-programs realize the simplified problem{task} of search. They carry out search in files only one, local server and generate the list url to various documents on search of the user.

Let's consider ways of creation of several types of search programs. They though cannot compete with ht: // dig and webglimpse, but enable to understand how similar programs work and as they are created.



Simple search and the command line


My liked language for creation of such programs - perl. Basically due to opportunities of processing of texts. perl allows to find with ease one fragment of the text inside another. For example, this program which will consist of one line, deduces every line a file test.txt which contains a word "foo":



perl-ne ' print if m/foo / ' test.txt


The key-n orders to not deduce{remove} all the line long by default, and-e enables us to insert a call of the command between single inverted commas ('). We ask perl to deduce{remove} every line in which operator m // (concurrence) will find the specified word. We can insert performance of this problem{task} into the program, as shown in Listing 1.



listing 1. simple-search-1.pl

*!/usr/bin/perl-w

use strict;

use diagnostics;

* open the file

my $filename = "test.txt";

open file, $filename

or die qq {cannot read "$filename": $!};

* iterate through each line of the file,

* printing all lines that match

while (<file>)

{

print if m/foo/;

}

close file;


Certainly, this program carries out search on a simple concrete mask (a line foo) inside one concrete file (test.txt). We can expand action of the program, using the empty index <> instead of search of century. The empty index <> looks through each element in @argv (a file of arguments of the command line), appropriating{giving} value of each element of a file of a variable $argv. If the command line does not contain arguments, <> expects data input from the user. Listing 2 contains the modified version of the program which carries out search of a line foo in several files. Pay attention that together with the found line this program deduces already and the name of a file. As $ _ already contains a symbol of translation into a new line, we do not need to add it{him} at the end of each print command. The second listing too can be reduced to one command line of a call perl:



perl-ne ' print " $argv: $ _ " if m/foo/; ' *


listing 2. simple - search - 2.pl



*!/usr/bin/perl-w

use strict;

use diagnostics;

* iterate through each line of each file

while (<>)

{

* print the matching filename and line

print " $argv: $ _ " if m/foo/;

}


Now we can complicate our program and allow to specify to the user a mask for search and names of files - objects of search. The program in listing 3 takes the first argument of the command line, deletes it{him} from @argv and transfers in $pattern. That perl has understood, that $pattern will not change, and search is carried out only once, we use m // with parameter/o.



listing 3. simple-search-3.pl

*!/usr/bin/perl-w

use strict;

use diagnostics;

* get the pattern

my $pattern = shift @argv;

* iterate through each line of each file

while (<>)

{

* print the matching filename and line

print " $argv: $ _ " if m / $ pattern/o; <n>

}


Now for search of a line ” f. [aeiou] “ in all files with expansion .txt we use:



./simple-search-3.pl " f. [aeiou] " *.txt


Be sure: now each line which contains a symbol f after which the public letter specified in square brackets follows, will be displayed on the screen together with a name of a file.


file::find


The above-stated program can form a good basis for web - search if all documents on a website are stored{kept} in one catalogue. But in a real life the majority of websites represents rather ramified hierarchical system of the subdirectories filled with files. The good search program should pass on all hierarchy of a website, see{overlook} each file in each subdirectory.

While we tried to carry out this problem{task} independently, someone has already worked for us. The module file::find which is part perl, allows to create the programs similar find, in this language. file::find exports the subroutine find to which can pass the list of arguments. The first argument is a link to the subroutine which is caused for each found file. Other arguments are names of catalogues and files which will look through file::find consistently, will not reach yet the last. For example, in Listing 4 the program using file::find for a conclusion of the list of files, stored{kept} in the concrete catalogue is resulted. Apparently, file::find exports a variable $file:: find:: name which contains the current name of a file. The name of the current catalogue is located in $file:: find:: dir.



listing 4. simple-find.pl

*!/usr/bin/perl-w

use strict;

use diagnostics;

use file::find;

* invoke "find" with a reference to our

* subroutine, and the initial directory name

find (*print_name, "/home/reuven");

sub print_name {print " $file:: find:: namen ";}

listing 5: simple-find-2.pl

*!/usr/bin/perl-w

use strict;

use diagnostics;

use file::find;

* get the pattern from the input list

my $pattern = shift @argv;

* slurp up the entire contents of a file

$ / = undef;

print qq {searching for "$pattern" .n};

* invoke "find" with a reference to our subroutine,

* with the directories passed as arguments

find (*find_matches, @argv);

sub find_matches

{

my $filename = $ _;

* open the file, and search through its

* contents

if (open file, $filename)

{

* get the file

my $contents = (<file>);

* if there are not any contents, then return

* right away

return unless $contents;

* print the filename, with the directory

print qq {$file:: find:: dir / $ filenamen}

if ($contents = ~ m|b$patternb|is);

close file;

}

else

{

warn qq {unable to open

" $file:: find:: dir / $ filename ": $!};

return;

}

}


In Listing 5 the program simple-find-2.pl which uses file::find for search in the files which are taking place in catalogues, enclosed in the specified set catalogue is resulted. As well as in other programs using file::find, the basic job simple-find-2.pl is carried out find_ matches - by the subroutine caused for processing of each file, found in catalogues which list contains a file @argv. To find all files containing a word " f. [aeiou] " in catalogues/home and/development, we print:



./simple-find-2.pl " f. [aeiou] "/home/development


The line of 8 programs simple-find-2.pl has special value: she renominates $ / - a variable which defines{determines} a symbol of the end of a line. Usually the operator <> looks through a file a line after line, returning undef at achievement of the end of a line. But we want to prospect in all a file even if the required word or a phrase will begin on one line, and to come to an end on another. After redefinition $ / a line



my $contents = ();


Transfers to a variable $contents all contents of a file file, and not just one line.


Search through the Web


Now, when we can search on any key phrase inside the separate catalogue, let's unit this problem{task} about the Web and we shall look through all documents of the http-server. Such program should receive from the user only a mask for search as the root of a website changes not so often.



<html>

<head> <title> search form </title> </head>

<body>

<form method = "post" action = "/cgi-bin/simple-cgi-find.pl ">

<p>

pattern: <input type = "text" name = "pattern"> </p>

<input type = "submit" value = " start search ">

</form>

</body>

</html>


In listing 6 the html-form which can be used for similar input is resulted. She passes the contents to the program simple-cgi-find.pl - to the cgi-program from Listing 7. Its{her} parameter pattern simple-cgi-find.pl the program contains in a result a mask for search which will be compared to each file in structure of a website, and returns the list of the found documents.


[/CODE]

listing 7. simple-cgi-find.pl

*!/usr/bin/perl-w

use strict;

use diagnostics;

use file::find;

use cgi;

use cgi:: carp qw|fatalstobrowser |

* which directory should start the search?

my $search_root = "/usr/local/apache/htdocs";

* slurp up files in one fell swoop

undef $/;

* create an instance of cgi

my $query = new cgi;

* send a mime header

print $query-> header ("text/html");

* get the text pattern for which to search

my $pattern = $query-> param ("pattern");

* make sure that $pattern is defined

unless ($pattern)

{

print $query-> start_html (-title =>

" no pattern named ");

print " <p>

you must enter a pattern! </p> ";

print $query-> end_html;

exit;

}

* start the html output

print $query-> start_html (-title => " search results ");

print qq {<p>

the following documents matched the

pattern "$pattern": </p> n};

* start an unordered list

print " <ul> n ";

* search for

find (*find_matches, $search_root);

* end an unordered list

print " </ul> n ";

print $query-> end_html;

* -----------------------------------------

* subroutine that searches through files for

* matches

sub find_matches

{

* make sure that this is an html file

return unless m/.html? $/i;

* get the filename

my $filename = $ _;

* open the file, and search through its

* contents

if (open file, $filename)

{

* get the file

my $contents = (<file>);

* print the filename, with the directory

print qq {<li> $file:: find:: dir / $ filenamen}

if ($contents = ~ m|b$patternb|is);

close file;

}

else

{

warn qq {unable to open "$filename": $!};

return;

}

}

[/CODE]

Unfortunately, the version file::find, delivered with perl, does not support a flag-t which includes a mode of safety tainting. cgi-programs always should be started with this flag that the data received from external sources, could not represent potential threat of safety of system. In our case we cannot use this mode. file::find addresses to the subroutine fastcwd the module cwd which cannot normally work with a flag-t. Now I advise to use these programs without-t, but with an output{exit} of the following version perl I strongly recommend to update the current version that cgi-programs could work in a mode tainting.

Our subroutine of search find_matches should be changed a little for convenience of users the Web. Search we shall make first of all in html or text files. There is no sense to touch all graphic files:



return unless (m/.html? $/i or m/.te? xt $/i);


On some websites hypertext documents have expansions .htm (or .htm), and text - accordingly .txt or .txt instead of .text. The above mentioned mask satisfies to all variants, ignores the register of symbols on a key/i and considers{examines} only expansions (tails of names of files) on a metasymbol $.

After reception of contents of the current file find_ matches checks presence $pattern inside a variable $contents which stores{keeps} contents of the document. We have surrounded $pattern with symbols b for search $pattern in borders of one word. Now search foo will not coincide with a word food in spite of the fact that this word contains our mask.

If concurrence is found, find_matches generates url, replacing $search_root on $url_root and hiding true hierarchy of storage of html-documents from external users. Then the name of a file together with a hyperlink on this url is printed:



if ($contents = ~ m|b$patternb|ios)

{

my $url = " $file:: find:: dir / $ filename ";

$url = ~ s / $ search_root / $ url_origin/;

print qq {<li> <a href = " $ url "> $filename </a> n}

}


We develop our web - search


Though our program simple-cgi-find.pl already works, she has some lacks. For the beginning she does not distinguish tegov html from contents of page. Search img should not give out to us all documents which contain teg - those from them which contain this word outside of html commands are given out only. For this purpose we shall force our program "to cut out" tegi html from an initial file.

Beginning{Starting} programmers often think, that the best way to get rid from tegov html is to remove everything, that is bracketed <?>, for example:



$contents = ~ s | <. +> || g;


As a symbol of a point "." In perl is a concurrence to any symbol, and plus "+" - with one or more previous, the resulted command on idea should remove all tegi. Unfortunately, actually it not so, and this command will remove everything, that is between the very first symbol "<" and the last ">" in a file. It occurs due to that standard masks at perl too "greedy" and try to maximize number of the found symbols.

We can reduce greed of a mask "+" and minimize quantity{amount} of concurrences, having added a symbol "?", for example:



$contents = ~ s | <. +?> || g;


But there is one more " a thin place " - a case if $pattern contains blanks. Whether it is possible to process key phrases which include blanks by search? Or we should divide{share} required words logic operators "or" or "and"?



listing 8. form with radio buttons

<html>

<head> <title> search form </title> </head>

<body>

<form method = "post" action = "/cgi-bin/better-cgi-search.pl ">

<p>

search string: <input type = "text" name = "pattern"> </p>

<p>

<input type = "radio" name = "type" value = "or">

at least one word

<input type = "radio" name = "type" value = "and">

all words

<input type = "radio" name = "type"

checked value = "phrase"> exact phrase <p>

<p>

<input type = "submit" value = " search! "> </p>

</form>

</body>

</html>


In this special case we too can bake our pie and eat it{him}. Having added a set "radio" of buttons in the html-form, it is possible to give to the user a choice of search on exact concurrence of all key phrase or on concurrence even one word which is included in it{her}.

Now we can improve our program, having added an opportunity of search on all phrase (as we and did{made} till now), search on logic "?" (all words of a key phrase) and logic "OR" (search even one word from a phrase should be found).

To realize search on logic "?", we divide{share} elements of a phrase, using the operator perl "split". Then we count up quantity{amount} of words which to us needs to be found, we touch them all on presence in a variable $contents. If the counter $counter has reached{achieved} zero value, means, we have touched all variants:



elsif ($search_type eq "and")

{

my @words = split/s +/, $pattern;

my $count = scalar @words;

foreach my $word (@words)

{

$count-if ($contents = ~ m|b$wordb|is);

}

unless ($count)

{

print qq {<li> <a href = " $ url "> $filename </a> n};

$total_matches ++;

}

}


Search "OR" to realize even easier: we again break $phrase into parts by quantity{amount} of blanks. If one of the received words is found even, we can immediately deduce{remove} a name of a file and a hyperlink and to come back from find_matches:



elsif ($search_type eq "or")

{

my @words = split/s +/, $pattern;

foreach my $word (@words)

{

if ($contents = ~ m|b$wordb|is)

{

print qq {<li> <a href = " $ url "> $filename </a> n};

$total_matches ++;

return;

}

}

}


Eventually we should inform the user quantity{amount} of the found documents. We shall make it, having created a new variable $total_matches which will increase for unit by each successful search (apparently from the fragments of the program above mentioned for search on "And" i "OR"). The program in which all these changes are brought, is called better-cgi-search.pl (listing 9 in this clause{article} is not resulted, but it is accessible in the archive containing all texts of programs, to the address: ftp.ssc.com/pub/ lj/listings/issue69/3753.tgz).


We exclude catalogues and files


Now we have ready search program, she can carry out all kinds of search which only it is possible to wish. A problem that our program can appear too good and useful. Many place the information in the Web and do not want, that she was immediately accessible to all. Do not create links to the certain catalogues and documents and do not want to give to them the general{common} access. But our program does not depend in any way on hyperlinks at performance of search.

The most simple output{exit} - to make so that the program did not look through catalogues in which there is a file .nosearch. This file should not contain any information as only his{its} presence means, that the catalogue is excluded from search.

To check up presence of a file .nosearch in each current researched catalogue simply. But such check by each call find_matches will appreciablly affect speed of job of the program. It will be better, if the program after will find out a file .nosearch, will save the information on this catalogue and will use her{it} further.


Other problem


We can solve these problems, having added in the program two lines. The first, in the beginning find_matches, immediately returns us back in case in the current catalogue the file .nosearch is found:



return if ($ignore_directory {$file:: find:: dir});


If we went on the following command, means, the file .nosearch in this catalogue has not been found. But the file .nosearch can be found out under various circumstances: when we research a file .nosearch when the file .nosearch is in the current catalogue or when he is in the parental catalogue, a level is higher. If the certain catalogue is excluded from viewing, means, it is necessary to exclude and all subdirectories enclosed in him{it}. Some commands which do{make} this job:



* mark the directory as ignorable...

$ignore_directory {$file:: find:: dir} = 1

if (($ _ eq ".nosearch") ||

(-e ".nosearch") ||

(-e "../.nosearch"));


The version of the program better-cgi-search.pl with these additions is stored{kept} in listing 10 (which can be found in archive under the link above mentioned).


Whether there are ways to speed up search?


If you have already started these programs most likely have already faced the mentioned before problem: they work very slowly. If your website will consist of hundred files, all works fine. But if your site has grown up to 1000 or 10000 files, the user will interrupt search, cannot wait results of his{its} job as he borrows{occupies} a lot of time.


For this reason the majority of serious search engines apply other strategy which divides{shares} process of search on two stages. At the first stage the conducting the indexation program touches all files and saves the information on their site. Then the second program starts the search client and looks through preliminary generated index file on presence of concurrences.


In following clause{article} I shall tell how to create such indexes and as them to look through. Probably, our simple search program cannot compete with glimpse and ht: // dig, but, at least, we have learned{have found out}, how similar programs work and as they are created.