BIO 256 - Computer Applications in Biology


Assignment 4 - Finding Information on the World Wide Web

Important References:


The fastest growing segment on the Internet is the World Wide Web (WWW, or simply "the Web"). The Web functions by client/server protocol. People can install a Web server on a computer on the Internet (Cal Poly Pomona has a number of web servers), and then they can set up "pages" that contain, text, pictures/graphics, sounds, and even movies and small computer programs. People on the Internet who have client software (better known as "browsers") can visit the sites and see everything. A site may allow visitors to interact in some way, e.g. signing a guest book, or downloading files. A site can also allow visitors to fill out forms; e.g., Cal Poly now allows you to apply for admission by filling out a form on a Web page - that way you don't have to mail it in. Of course, you still have to mail in your check, unless the site allows you to enter a credit card number.

In addition to the ability have multimedia presentations, another popular feature of the Web is the ability to move from place to place very easily. The person who sets up a page can provide "links" (also called "hyperlinks") to other pages and sites that they think visitors may want to visit.

The most popular Web browser programs are Netscape Navigator/Communicator and Microsoft Internet Explorer. These are full-featured programs that can handle text and graphics. They can also deal with sounds and movies, although they often need help from other programs to do this. Passing pictures, graphics, sounds, and movies takes an enormous amount of computer resources, and not evey machine can handle this. There are both older (Mosaic) and newer (Opera) browsers that are neither as large nor as resource-hungry as Netscape or Internet Explorer. There are also browsers that pass text only - the most common one is called Lynx, and it is available through the Unix shell of the Intranet for those of you who are especially adventurous.

Web servers and browsers utilize instructions written in hypertext markup language (HTML). HTML files consist of ordinary text, but contain special formatting commands called tags. For example, the text <B>BIO 256 is hard work</B> causes the "BIO 256 is hard work" to appear in boldface. The <B> and </B> turn boldface on and off, respectively. The text <A HREF= "http://www.csupomona.edu">Cal Poly Pomona</A>makes the Cal Poly Pomona part a hyperlink to our campus home page. The http://www.csupomona.edu is called a Universal Resource Locator or URL. You won't have to worry about HTML for your assignment, but it may help to know what these files are in case you see one. If you someday decide to set up your own page on the Web, you might want to become familiar with HTML.

There is so much material on the World Wide Web that a large number of tools have been developed to help you find things. There are two basic approaches to locating information. The first is often called a "subject directory" and the second a "keyword index".

A subject directory is put together by people, who evaluate each web page and categorize it by subject. There may be a whole team of people involved, as at Yahoo!, or a single person, such as Dr. Steve Wolf at CSUBioWeb.

A keyword index is a searchable list of words that occur in web pages. It is assembled by a computer program called a webcrawler or robot, which goes from web page to web page, indexing them. It requires some skill on your part to formulate the search, but it includes far more pages than a subject directory.

The chart below will help you to compare them:

Feature Subject Directory Keyword Index
Creator People, who choose to include links to web pages based on their own, sometimes expert, judgment of content. Computer programs, often called "robots", that go from page to page making an index of all the words.
Arrangement By subject; often hierarchic. Random access.
Ease of finding information if you aren't familiar with the subject Moderate to high, depending on how good a job the creators do. Low.
Likelihood of including every relevant page Low. Relatively high, if you have chosen your search terms well (although recent studies show that less than half of all web pages are indexed by any given search engine, even the best ones).
Likelihood of every page being relevant Moderate to high, depending on how good a job the creators do. Low to moderate; even with good search terms, there will always be irrelevant pages.
Quality of the pages recovered Moderate to high, depending on how good a job the creators do. Low to high; there is often no quality control at all.


Assignment

Now for our assignment. The first thing you have to do is find a browser. Virtually every Mac and Windows computer on campus has Netscape and every Windows computer has Internet Explorer. Since you don't need a printer for this assignment, even the ones in the reference area of the Library will work (although you are expected to give up your space to someone wanting to use the Library databases).

Both Netscape and Internet Explorer at Cal Poly almost always open with the Cal Poly home page, http://www.csupomona.edu/; if not, you'll have to enter the address in the Location box.

Remember the three cursors: The I-beam selects text, the hand clicks hyperlinks, and the arrow does everything else (which is usually nothing).

The best way to learn Netscape or any other browser is to simply play with the program. If you get "lost", or can't get back, just hit the button that looks like a house, or select Go | Home from the menu, to get back to the home page. To start the assignment, you should be at the campus home page. If you aren't there now, type in the URL, or quit the program and start it again. If you don't have this assignment on paper, you can open a new browser window (Netscape: File | New | Navigator window; Internet Explorer: File | New | Window) so you can switch back and forth between this assignment and the pages you are viewing.

  1. Find the Colleges, Schools, and Departments link and follow it.
  2. Find and follow the College of Science link.
  3. Find and follow the Biological Sciences Department link.
  4. Find and follow the Links link.
  5. Find and follow the CSUBioWEB link.
  6. Find and follow the Purpose of CSUBIOWEB link.
  7. If you've done everything correctly, you are now reading the Purpose of CSUBIOWEB document, which is located at Cal State Stanislaus (did you even know there was a Cal State Stanislaus?). Mail this page to jcclark@csupomona.edu, using the File | Send page option in Netscape. The subject must be bio256a4-1 (lower case, no spaces). (Have you noticed how all the subjects contain "bio256"?) Remember to include your name, SSN, and email address in the body of the message unless you are using a web browser on your own computer and you are sure that it is using your email address as a return address.
  8. Now, all on your own, find the home page for Dr. Curtis Clark of the Biological Sciences Department. Find the answers to the three questions below, and put them in the email for this assignment (see below). No cheating or helping each other!!!
    1. In what year did Dr. Clark get his Ph.D.?
    2. What is the title of the BIO 680 course that he teaches? (Hint: it's not "Open Topics")
    3. What is the name of any one of the Statistics Cops? (Hint: look under "Essays")
  9. You're done with this part of the assignment. Don't go away; there's more below.

The most popular and complete subject directory is probably a site called Yahoo! (the exclamation mark is part of the name). You can find it at http://www.yahoo.com. Go there now so that you can do the first part of your assignment. When you get to the site, you'll see links to about 14 different categories - such things as Arts and Humanities, Education, Health, and Science. Since we are scientists, follow the Science link. Now you'll see a menu of various scientific links. We are biologists, so follow the Biology link. Another menu appears - this time it's biological topics. Let's follow the Molecular Biology link. Yes, more links appear! We're interested in Journals, so follow that link - and guess what? More links! Browse down through the links, and you should find a link to a journal called Genes & Development. Write down the URL for this site (the journal itself, not the Yahoo! page that links to it), and add it to the email for this assignment (see below). List the URL in your email message on a line all by itself, like this:

http://www.csupomona.edu

Your instructor will follow this URL, and to get your credit, the instructor has to be at the home page for Genes & Development.

There are many search engines available on the Web, and they all claim to offer some advantage or another. When you really want to be sure you have found every bit of information on a topic, it is worth checking several of them. But for this assignment, we are going to concentrate on a specific one, Google. This search engine is only about three years old, but it has already established a reputation of always giving useful results, to the extent that there is now a verb "to google", meaning to look something up on Google. It is not as elaborate as other keyword searches: it finds pages with every search term, but won't allow you to search for pages with any search term unless you use the advanced search. It does let you search for phrases. Even with these limitations, though, it is often the best place to start.

Let's say you wanted to find out if there were any clinical trials going on in California that involved gene therapy for cystic fibrosis. If you enter the single word gene in the Google search box, Google will retrieve the locations of nearly six million pages that include the word. Considering how common the word is, this is not surprising. If you add the word therapy, Google will retrieve around 600,000 documents that include both words, and on the first screen of results, there will be several informative sites about gene therapy. If you put the words in quotes - "gene therapy" - Google will only show the 306,000 pages that contain the phrase (fewer, actually, than a year ago). Looking for phrases rather than separate words is a powerful search technique.

Adding the phrase "cystic fibrosis", so that the query reads "gene therapy" "cystic fibrosis", reduces the hits to just under 19,000 (the four words entered separately, not in phrases, give nearly 28,000 results). Add california (you don't need to capitalize it), and there are fewer than 3,000 hits. Those are a lot to go through, but the way Google is designed, the first ones it displays are the ones that have the greatest number of links on other sites, so they are more likely to be useful.

Perhaps you are seeking a summer internship at Stanford University. Add stanford to the search terms (they are now "gene therapy" "cystic fibrosis" california stanford) and there are 637 hits. Some of those also involve the University of California. If you aren't interested in those, you can exclude web pages that have the phrase "university of california" by putting a minus sign in front of it. The search "gene therapy" "cystic fibrosis" california stanford -"university of california" gives 219 results. You can see that refining your search is the secret to getting the information you need out of a mass of information that isn't useful.

Now for the next part of the assignment:

When Dr. Clark was a graduate student, he had an office in the Herbarium of the University of California, Davis. Most of the people who worked there then have retired (including his major professor, Dr. Donald W. Kyhos, and the emeritus director, John Tucker), and there is a new set of staff. Fortunately, they are all listed on a web page, along with associated and retired faculty. Develop a Google search query that will return a single match - the staff page from the herbarium. (Be sure to try it out to make sure it works!) Add the query to the email for this assignment (see below). The query should be on a line all by itself; like this:

gene therapy cystic fibrosis

Here are the questions from the assignment above that you should answer in your email. Each answer must be on a separate line. The subject must be bio256a4-2.

  1. In what year did Dr. Clark get his Ph.D.?
  2. What is the title of the BIO 680 course that he teaches?
  3. What is the name of any one of the Statistics Cops?
  4. What is the URL for Genes & Development?
  5. What is the exact Google query?

Summary of assignment

Check off Format Content Subject line
Forwarded web page Purpose of CSUBIOWEB bio256a4-1
Email message Answers to questions above bio256a4-2