Web Spider Creation In Perl, Visual Basic, And Java
Have you ever visited a thumbnail gallery post that automatically checked your gallery submission? Richard’s Realm and The Hun are just two heavyweights utilizing applied Web spider technology to increase productivity.Have you ever visited a thumbnail gallery post that automatically checked your gallery submission? Richard’s Realm and The Hun are just two heavyweights utilizing applied Web spider technology to increase productivity. Thumbnail gallery recognition is an excellent chore for automated Web robots capable of counting links, screening text for banned keywords, and declining galleries with JavaScript popup windows. Google, AltaVista, and FAST Web spiders scour the Internet constantly, refreshing massive databases with new caches as often as every month! The data harvested by these turbocharged Web spiders are providing the centralized memory bank needed for artificial intelligence pattern matching agents.
Click here to view a list of some of the most common Web robot spiders.
Q: So how can you, the adult Webmaster, harness the power of the Internet’s most powerful resource gatherers?
A: By leveraging the power of increasingly valuable databases through Web spider application.
Have you ever spent hours harvesting information – images, links, e-mail addresses, anything – from one or more Websites? How about checking each link on your oft-changing home page by hand? Did you feel as though the almost mindless repetition could be performed by an automated machine? Chances are, the task could be done faster and easier by a computer freed from the hassles of human interaction, graphical user interfaces (GUIs), and token bloated code. But which programming languages are the best for custom Web spider creation? I personally prefer Perl, especially when easy CGI compatibility and stable UNIX nativity are considered. However, Visual Basic’s ease-of-use and Java’s awesome threading capabilities provide viable solutions as well.
Perl
Perl – the “Practical Extraction and Reporting Language” – is one of the most popular Web programming languages thanks to its truly powerful string manipulation functions and robust error handling.
Perl is a broadly capable scripting language that is commonly used for easy manipulation of files, logs, and user profiles. The functions that Perl offers have been optimized for scanning text files, extracting information from files, printing reports, and managing systems, without placing any restrictions on the size of the data that the code is working with. Perl can scan large amounts of data very quickly by making use of sophisticated pattern matching techniques. Perl is an external application that is executed on an information server, in real time, through the use of a common gateway interface.
Native to the UNIX operating environment and the most common Web programming language for Common Gateway Interface (CGI) scripts, Perl suffers from slower execution speeds due to on-the-fly code interpretation. Executable (.exe) files generated by Visual Basic and C programming compilers are mostly illegible when viewed by humans, while Perl programs are simple text documents that must be read and converted to machine code at runtime. However, this performance loss hampers only the most heavily-trafficked or poorly equipped servers, and the wide availability of public Perl code and newsgroup support often counterbalances any performance issues.
The standard example of LWP client usage looks like this:
use LWP::UserAgent; $user_agent = LWP::UserAgent->new; $request = HTTP::Request->new(GET => ‘http://www.ynotmasters.com/’); $request->header(Accept => ‘text/html’);# send request to server and get response back $response = $user_agent->request($request);# check response outcome if ($response->is_success) { print “Success: ” . $response->content . “\n”; } else { print “Failure: ” . $response->status_line . “\n”; }
Visual Basic
VB is now the number one programming language for Windows application development. An easy-to-learn, easy-to-use language, Visual Basic is here today thanks to Microsoft and more specifically, Alan Cooper. In 1988, Cooper invented Ruby, the graphical precursor to Visual Basic. VB is a very rich and mature server-side programming language that allows for the customization of client applications. VBA, Visual Basic for Applications, is a powerful language that gives us the ability to enhance the functionality of various applications, such as Microsoft Word, Access, and Excel, by providing the Visual Basic tools necessary to create and customize programs in a variety of environments. Imagine a custom Microsoft Word VBA program that checks the Web for sources automatically, and then writes a research paper on the fly. It might sound like a fantasy, but the concept illustrates the power that Web-enabled applications can have.
Visual Basic Web spider creation is handled by the Internet Transfer Control ActiveX, an easy-to-use invisible form element. ITC is broadly functional, however, graceful error handling is required on the part of the programmer in order to prevent frequent program crashes when accessing varying data sources.
The following VB subroutine assumes that an Internet Transfer Control object is already in place in an existing project:
Private Sub YourButtonHere_Click()Dim txt As StringDim b() As Byte
txt = ""
‘ This opens the file specified in the URL text boxb() = Inet1.OpenURL(URL.Text, 1)
For i = 0 To UBound(b) – 1txt = txt + Chr(b(i))Next
‘ This loads the opened file into an existing RichTextBox control RichTextBox1.Text = txtEnd Sub
Java
Java is an object-oriented programming language that provides a robust and dynamic programming environment. One of Java’s major advantages is that it is platform-independent, meaning that it is capable of running on a wide variety of different operating systems. Java can also be multi-threaded which makes it an ideal language for developing applications for the Internet, intranets, and any other type of complex, distributed network environment. The Xerxes and SAX parsers make handling HTML and XML a snap.
Rather than provide simple code for a mostly worthless robot, I chose to include a link to JoBo, a sample Java Web user agent. Both the download and the source code are available here. The demo is highly functional, and the sample code can keep any interested person involved for hours.
Conclusion
“The popularity of the Internet has created a new challenge… In addition to being able to use files and documents stored on a local server, today’s applications are expected to extend their reach globally. A program should provide the flexibility of being able to retrieve a document from the local hard drive, other accessible systems on a network, or any computer connected to the Internet.” (Thayer 516).1
Making your Website interact directly with the Web (or having spiders harvesting a Web-enabled database) is the easiest way to provide your surfers with a truly dynamic product. The importance that server user agents will have in the future is set to grow dramatically. After all, how else will our refrigerators know when they’re running low on milk?
1 Thayer, Rob, et al. Visual Basic 6 Unleashed. Sams Publishing, 1999.
David Wolf has worked in the adult industry for three years. He specializes in international marketing strategy and web spider creation. David holds a BS in Business Information Systems, is currently pursuing his MS in Strategic Intelligence, and can be reached for follow-up inquiries at Wolf@adultwebmasterconsultants.com or through www.adultwebmasterconsultants.com.