Hughes


1 Introduction

This paper describes the development of a large multi-vendor e-commerce web site in detail.

There are many e-commerce web sites in the internet, generally one for each large company. These sites sell items from that company, and usually don't interact with each other very much. Furthermore, if you want to buy a specific type of item, you generally need to find a site that specializes in that type of product.

Currently, there are few places you can go on the internet to browse items from various vendors, there are even fewer sites where these vendors are smaller businesses, such as what you would find at a local craft fair. This is because the cost of entry to selling items on the web is rather high, both in monetary costs, and technical know how. We hope to lower these barriers, and give the smaller business a chance to sell on the web on a professional looking web site. Along with this comes the ability for a consumer to browse items from different origins in one central web site.

The general idea of the site is quite simple. A customer will order an item, that order will get sent off to the vendor, and then the vendor will fill the order. The system keeps track of the order by maintaining the state that the order is in.

The first state is when a customer browses the items for sale, chooses one (or more) and places an order. The order now goes into an "Ordered" state, and the customer's credit card is verified and charged. The order will remain in this state until the vendor selling the item reviews it and either accepts or rejects it.

If the order is rejected, the customer's credit card is refunded and the customer is informed. If the order is accepted by the vendor (Which should happen most of the time) the order goes into an "Accepted" state. It will remain in this accepted state until the vendor ships it, and tells us he shipped it, possibly with a tracking number.

The order will then go into a "Shipped" state. The customer will be notified to tell the customer it has been shipped. The item will remain in a "Shipped" state for two weeks. During this time the customer may come back to the site and say they got the item and it is acceptable, which means the order would go into a "Complete" state and the vendor will be paid. The customer may also come back and say that they have not gotten the item yet. In this case, the item will remain in a shipped state for another two weeks. The customer can also reply with a "Not Acceptable" in which case the administrators of the system would have to become involved to clarify any problems between the vendor and the customer. If the item reaches the end of its shipped period without feedback from the customer, we assume the customer received the item, mark it complete, and release funds to the vendor. This sales cycle can be seen in Figure 1 of appendix A.

This paper is organized as follows. The website is described in detail through section two. Section three talks about personalization and how it can be used. The fourth section describes what we did in terms of personalization. The fifth section explains much of the specific design for this project. The sixth section runs through the various implementation issues that were encountered, and the seventh section discusses some real world issues in getting the site up and running.

This project combines traditional e-commerce techniques with personalization techniques, and applies those to a multi-vendor website.


2. The Website


The development of the web site was an iterative cycle with several stages of development. After every stage of development, we had people not involved with the project navigate the pages as we observed their actions. This helped us to find bugs, and places where the user interface was not clear.


2.1 Customers


The most important users of the web site are the customers, and therefore we will describe the interface provided to them first. Upon visiting the site, the customer is presented with a home page that lists what the administrators of the site would like new users to see. If the user is a returning user, they are presented with a personalized home page, more about that can be found in the "Personalization" section below.

On every page there is a search function so the customer can search through the catalog. The index consists of a keyword-ID and an inverted index table. The keyword-ID table lists all of the keywords that could be found in any of the items' names or descriptions. It maps the keyword to an ID that can be referenced in the inverted index. The inverted index maps each keyword to one or more items. Since the inverted index is organized by the word and not the document, it is much more efficient to find all of the documents that contain a particular word. Words that occur commonly, but don't convey meaning (called stop words), such as "and", "a", "the", are excluded from the index. The index contains an Inverse Document Frequency (IDF). Where the IDF of keyword i is equal to:



N is equal to the number of items in the database, and DF is the number of items that contain the keyword.


To rank the pages we use a TfxIDF search and ranking algorithm. The algorithm gives us scores that represent how close a query is to a document. It does this in vector space. The equation is as follows:



TFij is the term frequency, or the number of times the term appeared, for a keyword in a specific document, and Tfi-max is the maximum term frequency of any keyword in the document. Q represents the set of words in the user's query. The IDF is the previously calculated value. This method gives higher weights to to keywords that are in shorter item descriptions [17]. This approach to searching is similar to the one described in "A World Wide Web Resource Discovery System" [4].

We use a shopping cart system to process sales, which has become the defacto standard for the internet. Next to the description of each item is a quick link to add the item to the shopping cart. The user can then view the shopping cart, remove items from it, and modify the quantities of each item.

When the user decides to check out and finish their shopping experience, they are prompted for their credit card. If the site has seen this person before (identified by their credit card number) they are prompted for a password. If the customer either gets the password incorrect, or the site hasn't seen the customer before, it prompts the user for their name, address, and other details required to make a positive identification. This can be seen in appendix B, Figure 17.

When browsing the site, the customer can click on "Add to favorites" buttons that add the item to their favorites list. Then, at any time they wish, the customer can view a page that lists all of the items on their favorites list. This technique is not new to computers, and has been used in web browsers for years in the form of bookmarks. Lately many web sites have been integrating this feature in an attempt to make the customer feel like they're more in control. The favorites section gives the user a way to easily remember items that they may want to buy later, and gives us data on which items the customer likes which may be used in several personalization techniques described below.

After a customer has placed an order, they can come back at a later time and check on the status of that order. They can also leave comments for the vendor, request that the order be canceled, and read the comments that the vendor has written about the order.


2.2 Vendors


The vendors are our next most important class of user, for they provide the items to sell. They will need to be able to accomplish a series of functions.

The first thing a vendor needs to do to sell their wares on our site is to register and be accepted. Once the vendor is accepted, they will get a username and password. With this username and password, they will be able to access their account management web page. This is the central point of control for that vendor, and lists all of the available actions that the vendor can perform. The vendor will need to provide their own computer and internet access.

Once a vendor has an account, they will need to identify the goods and services that they have for sale. From their account management web page they will have the option to add new items, manage existing items, and remove existing items. To add an item, the vendor will need to fill in a short form that includes the item's name, price, description, and several other important facts. They will also be able to upload one small and one large image of each item. The small image will have a set size of 150 x 100 pixels (This value is configurable by the site administrators), this is the image that will be used inline with the item in searches and in several other places. The larger image may be of any reasonable size and should provide a more detailed view of the item. A unique feature will be the ability to choose a template. This template will determine how the information about the item will be displayed on the web page, and can be selected from a gallery of pre-made formats. When a vendor has the items he wishes to sell entered in, he merely has to sit back and wait for the orders to come in.

One of the most important functions the vendors will have to do is maintain their inventory. They will have to keep accurate counts of the number of available items they have for sale. To do this, there is a simple form listing all of the items, and the inventories for each. This makes it easy to quickly update several different items. Whenever an item is sold, the inventory value for that item is updated automatically.

Occasionally it may be necessary for the vendor to change the personal information about their business. From the vendor account management page they will have the option to change this information. All changes will be shown on the administrator event log, so the administrators can take appropriate action against vendors entering in bogus information.

The vendor needs to be able to see his orders. This is done through an interface on their vendor home page. Some vendors may not expect to do a large amount of business. These vendors may wish to have an email reminder sent to them once a day if they have any new orders so they don't have to check their new orders every day.


2.3 Administrators


The administrators of the system, the people who run the web site, will also need several features to ensure the smooth operation of the site.

They need an event log. This log lists whatever changes may be made to the status of any order, any system problems, whether daily maintenance functions worked correctly (Like mailing vendors about new orders), and other system-specific items. The view for this log can be filtered to only view certain types of events, so different administrators can concentrate on different aspects of the site.

The administrators also need the ability to change the status of any order. From the administrative home page, the administrators can interactively change the status of any order. Care must be taken if financial transaction had already occurred in these orders, for the system will allow an administrator to double pay a vendor, or refund a customer's credit card more than once.

When vendors sign up for the service, they will be placed into a queue until the administrators verify the information. The administrators can then go into the vendor list, select a vendor, and edit any of the details for that vendor, including whether or not the vendor is active. This way they can change anything if it's offensive or incorrect. They likewise have the ability to change any of the information for the items being sold. This information includes any comments entered by customers. They also have a way to temporarily suspend a vendor or an individual item. The administrative interface for editing vendors can be seen in appendix B, Figures 22 and 23.

2.4 User Tracking

HTTP is a stateless protocol, but we need a way of tracking the user. The system we use will need to be flexible enough to be used for a wide variety of tasks, but require little intervention from the user.




3 Design


3.1 General Design Issues

One of the most important decisions in any project is the selection of a development environment, and more specifically a programming language. During this selection, we looked at several technologies. The first we looked at was writing the system in C or C++. This approach would probably have given us the best performance, but at a cost of longer development time. C just isn't suited well for server side web development. We then looked at Active Server Pages (ASP) from Microsoft. The major downside to this approach would have been our reliance on the Windows NT server platform, or a proprietary ASP engine that could run on other platforms. Java Server Pages (JSP) were a very attractive offer, but are still not widely accepted, and would have presented a problem later on when we tried to find a web hosting company to run the web site on. This left us with only two more obvious solutions, PERL and PHP. We decided to go with PERL since it is more universally available. PERL (the Practical Extraction and Report Language) has been proven as a good language to implement web systems in. We choose to run these perl scripts on an Apache web server, since it's the most widely used web server to date [2].

There are a great number of databases that we could have chosen. The "real" RDBMS such as Oracle, SyBase, and Interbase were all out of our price ranges. We had to settle on MySql (3) which is a free implementation of nearly the entire ANSI SQL standard. Compared to the other database engines, it is quite fast, but this speed comes at a price. MySql does not support transactions or data triggers, meaning extra care had to be taken on the site to preserve data integrity. It is due to this that we use a transaction log in addition to storing the values for the most important tables. Even still, it is impossible to assure no data will be out of sync if the database server crashes or looses it's network connection. The credit card processing company also maintains a database of charges on their side that we can reconcile with so, in the worst case scenario, we can always refund the customer's money.

This project followed an incremental and interactive development model. Craig Larman describes it as, "An iterative life-cycle based on successive enlargement and refinement of a system through multiple development cycles of analysis, design, implementation, and testing" [1]. In the first phase we developed the vendor interface, and the underlying structure for it. The second phase consisted of the customer searching and shopping pieces, and the third stage concentrated on the personalization of the web site.

The software itself is written in many small specialized CGI scripts. The alternative was to write one, or a few, large general purpose scripts. We choose this format because it is quicker to write the functionality of each page and it is easier to understand what each of these scripts does. The major downside to this approach is that it is often difficult to find the correct script that performs an action. We tried to use a naming scheme that would alleviate this.


3.2 Database Design


The web system was designed from the perspective of the data, and everything else was written to work around that data. Due to this fact, it is easiest to describe the design of the system in terms of that data, keeping in mind that there are web pages to edit, view, and modify nearly all of these tables, and those pages can be seen in Appendix B. You can see this design in Appendix A, Figure 3. The interactions between the tables are shown in the descriptions of the individual tables below.


3.3 Table Design


Each table in the system will now be described. First we will show the structure of the table, and then describe anything out of the ordinary about that table, and the important links to other tables.


3.3.1 Accounting

#

Field

Type

0

ID

int(11)

1

acct_type

enum('customer_charge','customer_credit',
'vendor_charge','vendor_credit','misc')

2

amount

decimal(8,2)

3

ordernum

int(11)

4

orderitem

int(11)

5

acct_num

int(11)

6

noteID

int(11)

7

when

timestamp(14)

8

invoice

int(11)

(Database table accounting)


Each record in the accounting table represents one type of financial transaction in the system. Most of these records will be from either money coming in from credit card sales, or payments going out to vendors. The other types of transactions include the paying of bills, or the income from outside investors. The acct_type field identifies the type of accounting record that this it. The amount is the dollar amount (USD) that the transaction is. Money coming into the site is always positive, and money being paid out from the site has negative values, this allows a simple summation of the records to get a quick snapshot of the financial status of the site. Accounting records are the revenue or expenditures that have already happened or are expected to happen, when the money is actually transferred, the accounting records become invoiced, and records in that table are created. You can see the vendor's view of these records in appendix B, Figure 19.

There is also an additional interface for the administrators, so they can create new accounting records, remove records that haven't been invoiced yet, or modify records that haven't been invoiced. Along with each accounting record, there can be notes that describe various aspects of each accounting record. These are stored in the account notes table below.

Since most accounting records are based on an order, we strayed from a completely normalized table structure and added an ordernum and orderitem fields that can be used to reference the orderID and orderItem fields from the orderItem table below. If the item has been invoiced, the ID field from the invoice table will be referenced in the invoice field, otherwise this field is set to null.


3.3.2 accountNotes

#

Field

Type

0

ID

int(11)

1

accountID

int(11)

2

comment

text

(Database Table accountNotes)


The account notes table lists notes that get written for the various accounting records. Many of these notes are automatically generated by the system and include useful data about each transaction. Other notes are manually entered by the administrators to describe extraordinary events.

The accountNotes accountID field references the ID field in the accounting table. There can be many accountNotes for each accounting record.

3.3.3 adAccount

#

Field

Type

0

ID

int(11)

1

accountName

varchar(20)

(Database table adAccount)


The advertisement system is still in its infancy, but provides a way to dynamically display different ads, and to keep track of how many of those ads were viewed, and when they were viewed. The adAccount table lists each "account" set up to display ads. This allows one source to have more than one ad that can be displayed on the site. The accountName is a friendly name so that the administrators can differentiate between the advertisers.


3.3.4 adLog

#

Field

Type

0

ID

int(11)

1

ad

int(11)

2

type

enum('view','click')

3

when

timestamp(14)

4

ip

varchar(15)

(Database table adLog)


The adLog is used to keep track of what ads were shown or clicked, when it happened, and to whom they were shown. The type field lets us record both views and clicks in the same table. The when field records when the ad was shown, and the ip field records the ip address of the user who viewed or clicked on the item. This table will allow easy billing to advertisers based on the activity of their ads.

The "ad" field references the ID field in an advertisement record.


3.3.5 advertisement

#

Field

Type

0

ID

int(11)

1

name

varchar(30)

2

adType

enum('h_banner','v_banner','text_small','text_large')

3

weight

int(11)

4

html

text

5

account

int(11)

(Database table advertisement)


Each record in the advertisement table lists one possible advertisement that can be shown on the web site. The name field represents a friendly name of the ad that can be displayed to the administrators, or printed on ad reports. The adType describes what type of advertisement the record represents. An h_banner is a horizontal banner that can be displayed across the top or bottom of the screen, v_banner is a vertical banner that is suitable for display on either side of the screen, text_small is a small text-based advertisement (typically under 20 words), and text_large is a large advertisement, perhaps several paragraphs long. The weight parameter determines how often the advertisement is displayed, where the probability of the ad being displayed is equal to the weight over the sum of all the weights of all the advertisements.

The account field links to the adAccount ID field, and must be a valid entry in the adAccount table.


3.3.6 allowedShipping

#

Field

Type

0

itemId

int(11)

1

shipID

int(11)

(Database table allowedShipping)


Each item has a list of ways it may be shipped to the customer. This list is set up when the item is first entered into the system, and can be edited through the vendor interface (Appendix B Figure 22).

This table associates the ID field in the items table with the ID field in the shipping table, with one record for each shipping method allowed for the item. There should not be any duplicate records in this table.

3.3.7 cart

#

Field

Type

0

ID

int(8)

1

customerId

int(8)

2

when

timestamp(14)

3

ip

varchar(15)

(Database Table cart)

The cart table represents a single shopping cart for a user (Appendix B, Figure 6). The "ip" field is the ip address the user is from, and the when field represents when the shopping cart or the contents of the cart was last edited. Any cart that hasn't been edited in a certain time period should be considered inactive and can be deleted. The length of the persistence of the cart is a business decision and is configurable depending on what the administrators decide.

The customerID is the standard user ID from the user tracking system, and can be related to the ID field in the users table.


3.3.8 cartItems

#

Field

Type

0

ID

int(8)

1

cartID

int(8)

2

itemID

int(8)

3

qty

int(5)

(Database Table cartItems)

The cartItems table lists the contents of a shopping cart. When a record in the cart table is deleted, all of the cartItems that reference that record should also be deleted. The qty field lists how many of the item is in the user's shopping cart.

The cartID field corresponds to the ID field in the cart table, and there should never be a cart item without a corresponding cart record. The itemID field lists the ID of the item in the cart from the items table.


3.3.9 categories

#

Field

Type

0

ID

int(8)

1

description

varchar(30)

2

score

tinyint(4)

(Database Table categories)


The categories that items may be in are stored here. Some examples are "Dolls", "Arts and Crafts", and "Clothing". The description field is the displayable name of the category. The score field is able to be set by the administrators, and is a measure of how visible they wish the category to be to people that have not indicated a preference to individual categories. The five highest scored items will be displayed on those user's quick category listing.


3.3.10 catItem

#

Field

Type

0

ID

int(8)

1

catID

int(8)

2

itemID

int(8)

(Database Table catItem)

An item can be in on or more categories, and the catItem table records this. The catItem table is much like the allowedShipping table, except it lists the categories the item is in instead of the methods of shipping allowed for the item. The vendor can edit the categories the items occur in through the vendor interface (Appendix B, Figure 21).

The catID field represents the ID in the categories table, and the itemID represents the item's ID from the items table.


3.3.11 ccauth

#

Field

Type

0

ID

int(11)

1

response

int(11)

2

respMsg

varchar(25)

3

authcode

varchar(20)

4

amount

float(10,2)

5

ordernum

int(11)

(Database table ccauth)


Every time a request to authorize a credit card is sent out to the credit card processing company, a response is sent back to our system. The table ccauth records these responses, which come back from the verification service. There is a limited set of valid responses, and anything but an accepted message (value = 0) should be considered a failure. The respMsg is a textual reason why the transaction was or was not approved. If the transaction was approved, we will be supplied with an authorization code. In the event of a discrepancy between the credit card processing system and our system, we must be able to provide this code. The amount field is the total amount of the transaction and should be equal to the total of the order, if it is not the administrators should be notified.

There is a single script that gets called whenever a response comes back from the verification company. This script fills this table in, updates the status of the order from preliminary to ordered, and creates accounting records to represent the order. If the vendor is set up to receive emails when new orders are placed, it creates a new email request and puts it into the email queue.

The ordernum references one of our internal orders via the ID field in the orders table. Since the credit card is authorized and processed for each order, and not per order item, there should only be one of these for each order.


3.3.12 counter

#

Field

Type

0

ID

int(8)

1

pagename

varchar(30)

2

count

int(8)

(Database table counter)

On many pages there is a message at the bottom that displays how many times that particular page has been viewed. Those counts are stored in the counter table, and are referenced by the url of the page being viewed. The script that displays the message is capable of looking at the environment variables supplied by the apache web server and automatically determine what page is being viewed, increments the count for that page, and then outputs that count.

3.3.13 creditCard

#

Field

Type

0

ID

int(11)

1

cardNumber

varchar(16)

2

customerID

int(11)

3

address1

varchar(30)

4

address2

varchar(30)

5

address3

varchar(30)

6

city

varchar(15)

7

state

varchar(5)

8

zip

varchar(10)

9

expireMonth

tinyint(4)

10

expireYear

smallint(4)

11

firstName

varchar(20)

12

lastName

varchar(20)

(Database Table creditCard)


The customer credit card and billing information is stored in this table. Records in this table can be periodically deleted for users who haven't been active for a certain time period to reduce the damage done if a break in ever occurs. The information in this table is entered during the checkout process (Appendix B, Figure 3).

The customerID field references the ID field in the customer table.


3.3.14 customer

#

Field

Type

0

ID

int(8)

1

lastName

varchar(30)

2

firstName

varchar(30)

3

email

varchar(30)

4

address1

varchar(30)

5

address2

varchar(30)

6

address3

varchar(30)

7

city

varchar(30)

8

state

varchar(30)

9

zip

varchar(10)

10

country

varchar(15)

11

spent

decimal(8,2)

12

allowToBuy

boolean

13

ccInfo

boolean

14

password

varchar(25)

(Database Table customer)

Information about the user is stored in the customer table. The address, state, city, and zip fields all correspond to the user's shipping address. This information is entered during the check out process (Appendix B, Figure 4). The password is used later on when the user wishes to access their account to edit account details, or view order status. This password is stored in an encrypted form. The allowToBuy field allows administrators to mark accounts that cannot purchase items, perhaps due to fraud attempts. If the ccInfo field is set to true, this means that there is at least one record in the creditCard table that represents the user's credit card and billing information. The spend field stores the total amount the user has spent, and can be used as a way to quickly determine your most valuable customers.


3.3.15 email

#

Field

Type

0

ID

int(11)

1

userID

int(11)

2

userType

enum('Customer','Vendor','Admin')

3

emailData1

int(11)

4

emailData2

int(11)

5

emailData3

int(11)

6

emailType

varchar(6)

7

emailSent

boolean

8

emailDate

date

(Database Table email)

Many different types of emails are sent by the system. These include such things as order notifications for vendors, shipment notifications for customers, and abnormal errors to the administrators. It would be rather inefficient to actually create the email and send it from whatever CGI script wishes to send it off, so an email queue has been created. When a script wishes to send an email, it fills in the appropriate fields in the email table and forgets about it. On a schedule a program runs and scans through the email table looking for new records. When it encounters one it builds the email and attempts to send it out.

The userID field can have several different meanings depending on the userType field. If the userType is set to 'Customer', it corresponds to a customerID, if it is set to 'Vendor' then it corresponds to a vendorID, and if it is set to 'Admin' it corresponds to an administrative user ID. The email program can then look up the email address of the user from either the customer, vendor, or admin tables. The emailType field lets the email queue know what type of email is requested. Some examples of this are 'ORDERS' which is an email to let the user know the order has been shipped, and 'ORDERE' which lets the user (usually the vendor) know an order has been entered. The three email data fields represent parameters for the type of email to be sent. These parameters are different for each email. For the two examples given above they would include the order number and possibly the order item. The emailSent field lets us know if an email has been successfully sent, and the emailDate tells us when that email was sent.


3.3.16 event

#

Field

Type

0

ID

int(11)

1

messageType

enum('Warning','Error','vendOrder','custOrder')

2

reference

int(11)

3

message

text

4

emailSent

boolean

5

emailDate

datetime

6

messageDate

datetime

(Database Table event)

During the course of operations, many events will be generated. These events can be warnings, errors, or orders abnormally changing status. Administrators can look at these warnings based on the messageType which corresponds to these different types of events. Errors are generally fatal errors in the scripts execution, things such as invalid data in a table found or running out of memory. Warnings are events that happened, and from which the script was able to recover. An example of this is if the customer enters in a name too long for the database, and the script had to truncate it.

If the event is about an order, the order ID is in the reference field. If it was a fatal error, the exit code is in the reference field.



3.3.17 history

#

Field

Type

0

ID

int(8)

1

orderID

int(8)

2

orderItem

int(8)

3

Action

varchar(30)

4

when

timestamp(14)

(Database Table history)

The history of every order is recorded to reconstruct the chain of events that an order went through in the case of a problem. The history table stores this information. 'Action' is a textual description of the action, and the when field is the date and time the action happened. The only type of event that is not recorded in this table is when an order's status is changed.

The order ID and orderItem both correspond to entries in the orderItem table, and can be related to that table. There can be many events for any single order item.


3.3.18 images

#

Field

Type

0

ID

int(11)

1

path

varchar(30)

2

vendor

int(11)

3

lastMod

timestamp(14)

4

name

varchar(15)

(Database Table images)

Each item has two images associated with it, a full size image and a thumbnail image. More than one item can point to the same image. The images table lists all of the images that the vendor has uploaded to the system's web server. The path field is a relative path from the root web directory to where the image can be found. The 'vendor' field lists the vendor ID of the vendor who uploaded the image. The name is a simple name by which the vendor to remember the image. The lastMod field tells when the image was last modified (either uploaded, or renamed).

When a vendor wishes to upload a new image, he seletcs the name and local file location in an HTML form. When one hits the submit button, the file is sent, and the file is saved in a file with a specially formatted name. Currently these images are configured to go into the /item_images directory of the web server, but this is configurable. After that, the vendor can select that image as either the thumbnail or full-sized image for any item. If they select it as the thumbnail, it is scaled to 150x100 when displayed.


3.3.19 invoice

#

Field

Type

0

ID

int(11)

1

accountingID

int(11)

2

datePaid

date

3

note

text

4

invbatch

int(11)

(Database Table invoice)

Any time money comes in or goes out of the system, an invoice record is created. The datePaid field lists when the money was actually transferred. The note field lists any special notes about the payment which may include check or account numbers. Certain types of accounting records will be grouped together in batches. An example of this is when a check is cut for all of the transactions a vendor did in the previous month. Only one check is cut, but many invoice records may be needed all with the same invBatch number. These invoice records are either created through an interface to the administrators, or automatically by an invoicing program that runs once a month.

The accountingID is a reference to the accounting table's ID field which contains information about the amount of the transaction, when it was recorded, and other useful information.


3.3.20 itemComment

#

Field

Type

0

ID

int(11)

1

itemID

int(11)

2

comment

text

3

commentAuthor

varchar(60)

4

commentDate

datetime

(Database Table itemComment)

From then on customers or administrators can attach comments to items. Whenever a customer looks at the detail of that item, the comments for that item are displayed. The itemComment table holds these comments. The administrators have a method to edit or delete comments.

The itemID references the ID field in the items table, and there can be several comments for each item.


3.3.21 itemPref

#

Field

Type

0

customer

int(11)

1

item

int(11)

2

lookTime

int(11)

3

lookNum

int(11)

4

buy

int(11)

5

views

int(11)

6

actionLike

int(11)

7

collabLike

int(11)

8

lastlook

datetime

(Database table itempref)

The itemPref table represents how much a particular customer likes an item. The lookTime field corresponds to how many seconds the user has looked at the item in total, the lookNum represents how many times the user has looked at the details of the item. The buy field shows how many of the item the customer has bought. The actionLike value is calculated from these fields, and represents the implicit vote that the user gives to the item. The collabLike value is the collaborative filtering algorithm's prediction on how much the user will like the item. This field should only be referenced if there is no actionLike value. The lastLook field is used to calculate the lookTime for the user.

Since the web server only gets requests for pages, and does not record when the user stops looking at the page, we assume the time that the user looks at a page to be from when the first views the page until they look at a different page. Every page of the system has a special script embedded in it that updates any items that the user was looking at.

The customer field represents a record in the users table referenced by the ID field. The item field links to the ID field in the items table. Each customer / item combination in the itemPref table should be unique.



3.3.22 items

#

Field

Type

0

ID

bigint(8) unsigned zerofill

1

shortName

varchar(30)

2

longName

varchar(60)

3

description

text

4

vendor

int(8)

5

price

decimal(8,2) unsigned zerofill

6

thumbnailUrl

varchar(30)

7

imageUrl

varchar(30)

8

infoUrl

varchar(30)

9

inventory

int(6)

10

lastModified

timestamp(6)

11

expires

datetime

12

active

enum('','True','False')

13

weight

double(4,2)

14

width

double(4,2)

15

height

double(4,2)

16

depth

double(4,2)

17

taxable

boolean

18

shippingCharge

boolean

(Database Table items)

Each item that is for sale is represented by a record in the items table. Information about the name, the weight, and who sells the item is all recorded in this table. The inventory field lists how many of the item the vendor has for sale. If the item is to have a limited life span, the expires field can be set, in which case the system will not allow any more of that item to be sold after that date. Likewise, the item will not be able to be sold if the active field is set to false. The taxable and shippingCharge fields represent whether or not the item is taxable or if a shipping charge should be added.

Any item that has had at least one item sold can not be deleted, for it would break links to several other tables. The preferred method in these cases is to simply mark the item inactive. If none of the item has been sold, any record in the following tables that references the item should also be deleted; allowedShipping, catItems, cartItems, searchIndex, and itemPref. A simple way to determine if an item has been sold is to check the orderItem table for any references to the item. If one exists, the item has been part of an order, and can not be deleted.

The vendor field corresponds to the ID field in the vendors table.


3.3.23 orderComment

#

Field

Type

0

ID

int(11)

1

orderNum

int(11)

2

orderItem

int(11)

3

author

varchar(60)

4

comment

text

(Database Table orderComment)

Most orders should flow through the system without problems. The customer orders something, the vendor accepts it, and it is shipped in a reasonable amount of time. However, there will be times when the vendor is late in fulfilling the order, or special instruction on delivery are needed. When these types of things happen, comments can be attached to orders. When the status of the order is viewed, these comments can be seen, including when they were written and by whom.

The interface to create the comments also has the option for the author of the comment to also email the comment to the customer, vendor, or administrators. This gives the author of the comment a reasonable amount of assurance that the other parties see it. A comment from the customer is of no use if the vendor never reads is.
The orderNum field relates to the orders table, and using the orderNum and orderItem fields, the orderItem table can be related to.


3.3.24 orderEvent

#

Field

Type

0

ID

int(11)

1

orderId

int(11)

2

orderItem

int(11)

3

oldStatus

smallint(6)

4

newStatus

smallint(6)

5

when

timestamp(14)

(Database Table orderEvent)

Whenever an order changes state, that change is recorded in the orderEvent table. Both the previous and the new statuses are recorded. The administrators can then use this table to track the life span of any order. Reports can be written that display the average time that the orders for a particular vendor are in a certain state, or if any items somehow skip a state, which would indicate a problem with the system.

The orderId and orderItem fields can be used to relate to the orderItem table. The orders table can also be referenced using the orderID field, but since an event is written for an orderItem, the usefulness of this is limited.

3.3.25 orderItem

#

Field

Type

0

ID

int(8)

1

orderID

int(8)

2

orderItem

int(3)

3

itemId

int(8)

4

status

enum('Ordered','Accepted','Shipped','Recieved','Complete','Not
Accepted','Not Recieved','Not Complete','Preliminary','Bad Credit')

5

salePrice

decimal(8,2)

6

qty

int(3)

7

taxPrice

decimal(6,2)

8

shippingPrice

decimal(6,2)

9

expDate

datetime

(Database table orderItem)

Each order can have many items in it. These items are recorded in the orderItem table. The item price is recorded in each order item due to the fact that the price of an item can change, and it is important to note what the price was at the time of the sale. The same logic goes into the storage of the amount collected for sales tax and shipping.

The status field corresponds to the state that the item is in. This is stored on a per item basis since different parts of the order may be filled by different vendors, and the state of those different items may take a varying amount of time.

Vendors can view their order from their order status page (Appendix B, Figure 14). The customer can also view the status of their orders (Appendix B, Figure 8). At certain stages in the life span of the order, the customer or the vendor may also be able to promote the order to the next stage. For instance, by entering the item in, the customer places the item in an ordered state. When the vendor decides to accept the order, they can place the order in the accepted state.

The orderID field should reference a valid ID in the orders table. Each order should have at least one order item.


3.3.26 orders

#

Field

Type

0

ID

int(8)

1

customerID

int(8)

2

datePlaced

datetime

3

ip

varchar(15)

4

emailWarning

boolean

5

emailAccept

boolean

6

emailShip

boolean

(Database Table orders)

Each order that is placed will have an entry in the order table. This table is the header information for an order, and gives each item in that order something to easily reference.

The ip address is stored as an added assurance that the user who places the order is actually the user. This value can usually help the administrators track the user back to the ISP of the user.

If the user had requested to receive emails when the order was accepted or shipped, the emailAccept and emailShip flags would be set. The emailWarning is an option where the customer can have the system send an email if an order item is about to expire. A program that runs on a schedule runs through these order items and generates email queue entries for each email that should be sent on a particular day.

The customerID corresponds to the ID field in the customer table. Since the customer can delete his personal information from the system once all of his or the orders are complete, a value of -1 in this field will represent this.


3.3.27 searchIndex

#

Field

Type

0

item

int(11)

1

word

int(11)

2

weight

decimal(8,6)

(Database Table searchIndex)

The searchIndex table is used for the free text search of the items. It is filled nightly by a program that indexes the items database. The weight is the correlation of the weight of the word to the item as described in the searching section above.

The word field references the ID field from a record in the searchWords table. The item field references the ID field from the items table.



3.3.28 searchWords

#

Field

Type

0

ID

int(11)

1

word

varchar(15)

2

idf

decimal(8,6)

(Database Table searchWords)

The searchWords table is also used in the free text search of items. This table lists all of the available non stop-words that occur in any of the item names or descriptions, and is also generated by the nightly indexing program. The idf is the inverse document frequency required to calculate the relevance of a term in a search query.


3.3.29 shipping

#

Field

Type

0

ID

int(8)

1

Name

varchar(30)

2

Shipper

varchar(30)

3

minCost

decimal(8,2)

4

maxCost

decimal(8,2)

5

calcCode

varchar(5)

6

markup

int(3)

7

handlingFee

decimal(6,2)

8

vendor

int(11)

(Database Table shipping)


The system is capable of knowing about many different shipping methods. These methods are stored in the shipping table. The name is a common name that will be displayed to the customers, good values include "FedEx overnight" or "US Mail". The shipper should be the company used to ship the product. The fields minCost and maxCost define the minimum and maximum amount that will be charged for shipping using this method. There are several ways to calculate the shipping costs, and calcCode holds a code specifying the method that should be used. The markup field represents a percentage to mark up (or mark down) the calculated shipping cost. The handling fee is an additional dollar amount to add to all shipments using this method.

The actual calculations for shipping are then equal to:

cost = min(max(C(zip1, zip2, weight) * markup + handling, minCost), maxCost)

C(zip1,zip2,weight) is the calculated cost of shipment based upon the calcCode. Each shipping company has a different way of calculating this value, and they usually involve looking the value up in a given table. The two zip codes are the sender's zip code and the receivers' zip code.

Most shipping methods will be universal, in which case the vendor field will be equal to -1. However, if none of the universal methods fits a vendor, they can create a new shipment method, and then the vendor field would be set to the ID of that vendor. Only the vendor that creates a shipment method will have access to use it.


3.3.30 shippingZone

#

Field

Type

0

source

char(3)

1

dest

char(3)

2

zone

tinyint(4)

(Database table shippingZone)

The different shipping companies use shipping zones to calculate the shipping costs. These zones are supplied in a series of comma separated text files. When new tables become available, there is a special program that our system can run to parse those files and update the shippingZone and upsGround tables. The source and dest fields reprsent the first three characters in a zip code.


3.3.31 upsGround

#

Field

Type

0

weight

int(11)

1

zone2

decimal(5,2)

2

zone3

decimal(5,2)

3

zone4

decimal(5,2)

4

zone5

decimal(5,2)

5

zone6

decimal(5,2)

6

zone7

decimal(5,2)

7

zone8

decimal(5,2)

(Database table upsGround)

The upsGround table gives the rates for shipping items via ups ground between different zones. To calculate the cost, simply do a lookup based on the weight rounded up to the nearest pound.


3.3.32 users

#

Field

Type

0

ID

int(8) unsigned

1

customerID

int(8)

2

vendorID

int(8)

3

lastIP

char(15)

4

lastUse

timestamp(14)

5

securityId

char(12)

6

admin

char(10)

(Database Table users)

At the heart of the user tracking subsystem is the users table. This table is used to uniquely identify each user, and then to associate that user with customer, vendor, or administrative accounts. The ID field in this table is often referred to as the user ID in other tables and subsystems. On the user's browser is stored two cookies, a user ID and a randomly generated security ID. Those two cookies must be equal to the ID and securityID fields in this table for the user to be authenticated.

If the user is logged in as a customer, then the customerID field can be related to the customer table through the ID field. If the user is logged in as a vendor, then the vendorID field can be related to the vendors table. In either case if the user is not logged in, the field will be equal to -1.


3.3.33 vendorComments

#

Field

Type

0

ID

int(11)

1

vendorID

int(11)

2

commentAuthor

varchar(255)

3

commentDate

timestamp(14)

4

comment

text

(Database Table vendorComments)

Customers or administrators can write comments on vendors in much the same way as they can write comments on the individual items that are sold. The vendorComments table stores those comments. Some examples of vendor comments can be seen in Appendix B, Figure 10.

The vendorID field corresponds to the ID field in the vendors table.



3.3.34 vendorContacts

#

Field

Type

0

ID

int(11)

1

vendor

int(11)

2

firstName

varchar(25)

3

lastName

varchar(25)

4

position

varchar(25)

5

address1

varchar(20)

6

address2

varchar(20)

7

address3

varchar(20)

8

phone

varchar(20)

9

city

varchar(20)

10

state

varchar(10)

11

country

varchar(20)

12

email

varchar(40)

13

public

boolean

14

zip

varchar(5)

(Database Table vendorContacts)


Vendors can have multiple points of contact. If they wish, they can set up different contacts for product support, shipping questions, or billing questions. The vendorContacts table maintains this list. If the public field is set to false, then the vendor contact is not public, and should be hidden from customers or other vendors. Each vendor is required to have at least one public vendor contact.

The vendor field corresponds to the ID field in the vendors table.



3.3.35 vendors

#

Field

Type

0

ID

int(8)

1

shortName

varchar(30)

2

fullName

varchar(50)

3

phone

varchar(10)

4

address1

varchar(30)

5

address2

varchar(30)

6

address3

varchar(30)

7

city

varchar(30)

8

state

char(2)

9

country

varchar(10)

10

zip

varchar(10)

11

url

varchar(30)

12

reliability

int(3)

13

status

enum('Active','Inactive','Pending','Suspended')

14

description

tinytext

15

username

varchar(20)

16

password

varchar(20)

17

taxRate

float(10,2)

(Database Table vendors)

Each vendor in the system will be represented by a record in the vendors table. This table lists the business information of the vendor. The taxRate field stores the vendor's tax rate for the sales tax of their home state. The reliability score represents how reliable the vendor is, and is shown to the customers in the vendor reliability graphs (Appendix B, Figure 16). Vendors can change their address, url, name, or password fields. All other fields are maintained internally by the system, or supplied by the administrators. The interface for this can be seen in Appendix B, Figures 22 and 23. Only vendors with a status field equal to 'Active' are allowed to log in or sell items.


4. Implementation


Many of the issues described in the design portion of this page are very straightforward, and can be inferred from the descriptions of the database tables. Simple actions that merely view, add, remove, or modify records in tables will not be discussed in detail. Some operations, however, can be efficiently implemented in many different ways, and those implementation issues are presented here.


4.1 Searching


One of these functions is the creation of the search index. Every night, an indexing program is run to index all of the items in the database. This indexer is very similar to the ones used by popular web site search engines, except ours indexes a local database table instead of a remote web page. The indexer also has the task of calculating the inverse document frequency (IDF) of each word as described above. If the searching system didn't have the nightly index, we would need to search through the names and descriptions of every single item for every single query. The downside to our approach is that you can't search for items that are not in the index. This means that items are not immediately searchable when they are entered in or modified. A way around this would be to have the index updated whenever an item is updated, and this would be a valuable extension to this project.

As you may remember, the formula for calculating the relevance of an item to a query was defined by:


An optimization we used was to calculate during the index process and save it. We called this the weight Wij of the term i for the document j. Our equation then becomes:

Riq is then used to sort the items in order of relevance, and at most M items are shown on a single web page where M is a configurable number between 1 and the total number of documents retrieved.

The other end of the searching is the user interface and the search engine itself. The user interface was implemented by an HTML form on the top of every page to allow searching from anywhere. The HTML form triggers a CGI script that calculates the relevance of the various items, and then presents them to the user. This can be seen in Appendix B, Figure 15.


4.2 New Vendors


The queue for new vendors was implemented by using the standard vendor database table, and setting the status to "preliminary". To be placed in the queue, the vendor would fill out the new vendor HTML form (Appendix B, Figures 25,26), and submit that to a CGI script that stores the data in the database. From there, that vendor will be investigated to assure they can provide the goods or services they wish to sell. This investigation is a manual process conducted by the administrators. Depending on the volume of business that the vendor wishes to do, it may involve credit checks or business references.


4.3 Vendor Order View


The vendor is given a web page that allows him to view his orders. This page can be filtered in three different ways: one for open orders not yet accepted, one for orders accepted, but not yet complete, and one more for orders that are complete. Each of the views are identical, and consist of a table with the item ordered, the customer's name, the price of the item, the amount paid for shipping on the item, and the amount that the administrators of the web site took from that order. There is also a link for each item that shows detailed information on the item, the customer, and the shipping method requested. The "Open Orders" page has a means for the vendor to accept or reject the order. The "Accepted Orders" page has a way to mark the item as shipped. The "Completed Orders" page also shows when the funds were released to the vendor. This is done through another CGI script that queries the database, and can be seen in Appendix B, Figure 14.


4.4 User Tracking


Since the HTTP protocol is a stateless protocol, we need a method of identifying users. One of the most popular ways of doing this on the internet is through the use of a cookie, a small piece of data that the user's browser holds and sends back to the web server on each request. We use two cookies, one is a user ID, and the other is a security code. The user ID corresponds to an account ID stored in the users database table. The security code is a 12 character random string generated the first time the user visits the site. If the security code does not match the value we have stored for the user, we do not allow the user access to the user account. This way, a user can not simply change their user ID cookie to access other people's accounts. For the security code, we use the 26 capital letters, the probability that an intruder can guess the security code for a specific user is 1 in.

Whenever a new user visits the web site, a new user account is created for him. Initially the account holds little information, just the user's IP address, the security code, and the date / time of the last visit. As the user browses the items, that information is recorded against this account ID. If the user buys something, a new customer account is created, and the customer account is related to the user account. If the user logs in as a customer, the system also associates the customer account with the user account. If the user logs in as a vendor or an administrator, the system associates the vendor or administrator account with the user account. Through this method, the web site slowly learns more and more about a user with little intervention from that user.


4.5 Testing And Security


We have extensively tested the web site throughout the development cycle. After each new feature (or set of features) was written, we verified the following items:


          1. We check all incoming variables

          2. We properly quote anything going into the database

          3. All database operations are consistent with our table design


The first two are done for security reasons, and help to weed out most security concerns. Checking of incoming variables generally meant making sure that any invocation of CGI->param() returned a good value of an acceptable length. This eliminates most buffer overrun type exploits. To quote the values going into the database, we used the DBD->quote method which replaces any special characters with their back-slashed counterparts (Such as turning a newline into \n, or a quote character (") into a slash quote (\") ). By doing this on all values put into the database, we could assure that queries wouldn't fail for missing values, intruders couldn't alter the query using, and it made the code cleaner in certain places where we had quote overload. These two tests are easy to apply to any perl script. The third test is much more complex, and provides a sanity check for the system. As we created each table, we defined a set of rules that governs any data that goes into that table. In a more advanced database server, we could have put this logic right into the database, but since our database server doesn't support this, we needed to do it in our code. As long as no script makes bogus changes to database tables, the system should remain in a consistent state.

In addition to these standard tests, we performed usability testing. Throughout the project, two people not involved with the development helped us with usability testing, and through these people we were able to identify when the sites function were not clear to the user.

After each script was written, it was thoroughly tested. We created several test cases for each script that identified all the functionality of that script, and assured that the script performed correctly for those test cases. We also tried using the script in incorrect methods, making sure that malicious users could not damage the system. Often times several scripts worked together towards one goal, such as checking out. When that happened, we created additional test cases that test these interactions.


5. Personalization


Web personalization is the process by which the content or layout of a website is altered to better match individual users. The personalization can be as simple as displaying someone's name on a web page to complex collaborative filtering where the user is presented with options the system thinks they will like.

When considering using any personalization techniques, it is important to always remember why you are doing it. There are many new tools emerging for personalization, and often times it is tempting to use any new flashy technology that comes along. There is one simple rule that should be considered: personalization should add value to the customer's experience. In this context, the customer could be an actual paying customer, someone looking at your web site, or whoever the target user happens to be. If the personalization hinders the user, it either should not be used, or not used by default. [5]

The personalization system must be predictable. People will be driven away from the system if they think it's just randomly throwing selections at them. For instance, consider a personalization system that displays newspaper articles that it thinks the user would want to see. If the user cannot figure out why it is showing the items that it is, they are less likely to use the system. It should also have a way for the user to view other articles, in case the system is incorrect. [6]

Another reason to personalize is that the personalization collects significant amounts of data. The data is often of particular interest to people doing market research. Buying trends, customer classification, and advertisement targeting can all be derived from this data. [7]


5.1 Collaborative Filtering


One technology available for personalization is collaborative filtering. Collaborative filtering is the process by which data is collected from a large group of people. Then predictions are made based on the similarities of the users. It simulates "word of mouth" recommendations between people who have never met or communicated with each other.

Collaborative filtering generally is used for recommendation systems. The recommendations can be implicit or explicit. Explicit recommendations are those a user actively seeks out by pressing a button or clicking on a link. Implicit recommendation is the process by which the user may not even know there is a collaborative filtering system, and the system puts items that the person is more likely to want in more visible places.

One example of a collaborative filtering system is Ringo [11], which gives personalized music recommendations. Ringo exploits the similarities between the tastes of different users to recommend musical pieces. It first asks the user to rate 120 artists based upon how much the user enjoys listening to the artist (Or if the user has even listened to the artist before). The user can then ask Ringo for albums / artists they will like, artists / albums they will hate, or to make a prediction about a specific item. Ringo has become increasingly popular since it's release in 1994, and now boasts around 2100 users. [10]


5.2 Implicit Voting


The Ringo example above uses explicit voting. The user had to actually say how much they liked each item. The opposite end of the spectrum is implicit voting, where the system makes guesses about how much a user liked each item based on their actions.

Explicit voting requires the user to do work, and this work is not generally rewarded until the user has given their opinion on a large number of items. If the user does not perceive the benefit of voting on items, or if the cost of that benefit is too high, the user will not vote on the items. Implicit voting gets around this by not actually requiring the user to do anything differently than if the system were not in place. Generally, implicit votes carry less meaning per vote than explicitly voting, but there tend to be more of them available, making them just as valuable as explicit voting. [12]

In order for the system to make a guess on the vote, it must first collect data about the user's actions. Listed below are some of the more common actions that can be recorded.

Action

Notes

Item Purchased / Purchase price

The fact that a customer has purchased an item, can be taken as a strong indication that the user liked the item.

Assesses

The user evaluates or recommends the item.

Save / Print

The user either saves the item or prints it out.

Delete

The user deletes the item

Consider

The user looks at the summary of an item.

Repeated Use

The user views the item many times

Mark

The user bookmarks or adds the item to a favorites list.

Refer

The user refers to or cites the item.

Examine / Read

The user looks at the item's full description.

Glimpse

The user sees the title in a search or similar list.

Queries

Association of terms from queries.

Table 1 (Adapted from [12])

Not all of these fields are applicable in every situation. The actual vote for an item is a combination of these fields. This combination is dependent on the particular type of items to be voted on. The same criterion for rating Usenet news articles will not work for selling items from a web based store.


5.3 Adaptive Web Pages


Adaptive web pages are those that change their structure to better suit users. Adaptive web pages can be split into two categories whether or not this structuring is done on a per user basis.

A system that does not customize on a per user basis may be further broken down into two categories. One is where the system automatically updates the structure of the site, and the other is where it simply makes recommendations to the webmaster. Either system will somehow look at the way people visit the website, either through server logs or special CGI scripts, and then group behaviors together. The manual system will then recommend changes to the webmaster. The automatic system will make the modifications to the website itself. These modifications may be from highlighting often used links, adding or removing links, to even adding new web pages that aid in the navigation of the site. Many times the automatic systems will have safeguards against removing important information or navigation items. These safe guards could be as simple as not allowing anything to be removed to a required approval by the webmaster [13].

Since a website can experience a large number of users, a system that adapts the web pages on a per user basis will generally always modify the pages automatically with little or no interaction from a webmaster. Often times these adaptive web sites will use collaborative filtering techniques to make their decisions. This is because many people will navigate the website in similar ways, and the user can benefit from the navigation of the website from other users before them. [14]

One example of adaptive web sites is the WEBSA [14] project. This system did not actually modify the web page, but it did add a navigation item to the bottom by adding a new frame to the website. This special frame showed a list of possible places to go based upon the user's previous visits. If the user had not previously visited the web site, it would generate a list based on all of the past users. The list would contain links that are at most 5 clicks away from the current page, and at minimum 2 clicks. This was done to make the problem manageable, and avoid repetitive links since anything one link away would have a link on the page being viewed. This list would be generated off of a database that was updated by a nightly program. Most of the processing was done in this nightly program, and hence the system could deal with large amounts of data.

As you can see, adaptive web sites can be of great usefulness. However the effectiveness of them to make more than trivial changes has not yet been proven [13].


5.4 Task Customization


Most carpenters use the same hammer to create a bookshelf as to create a desk. This is because they like the tool, and it suits them. The user will select the best tool for them, and this is the idea in task customization, to allow the user to use the same tool in multiple places. For instance, if a user has a really great bookmarking tool, wouldn't it be nice if they could use that instead of the bookmarking tool that comes with their web browser? Task customization on a website may include allowing the user to use some sort of standardized search utility. This standardization would require that many different web sites got together and made a protocol for each of these standard tools. Unfortunately, this standardization is currently lacking on the world wide web.


5.5 User Customization


User customization is when a user selects the items they wish to personalize, usually explicitly from a list. This customization will often take the form of selecting colors for a web site, or selecting the type of content that will be shown on the website.

Yahoo uses this approach in their My Yahoo! Page. On this web page, the user can select from a variety of modules to be displayed on their page. These modules can range from top news stories of a particular area, to the picture of the day. Some of these modules can even be further customized, such as a weather module showing only the local weather for the user. Not only which modules, but where they are placed on the web page can be selected. [6]


6. How We Personalized


6.1 Implicit Voting

We choose to use an implicit voting mechanism to record data on the user's preference towards items. We record how long and how many times the user looked at an item, if they bought the item, and how many times the item came up in a search (but didn't necessarily click on the item to get it's details). Since the correlation of the items is dependent on the problem, we had to guess at a reasonable configuration to get an implicit vote.

We used an algorithm as follows:


Where Vi,j is the vote of user I for item J, LookTimei,j is the amount of time the user i looked at item j, LookNumi,j is the number of times the user looked at the item, and Views refers to the number of times the user could have viewed the item (Such as seeing the summary in a search).

This formula gives twenty five percent of the vote weight based on the amount of time the item was viewed, fifty percent of the weight on the number of times viewed, and twenty five percent on the ratio of the number of times the item was viewed to the number of times that it could have been viewed. The LookTime and LookNum are normalized so all users are using the same scale.

Nearly all of the item viewing options go through the search.CGI script, this made it easy to record the actions of users. Every time an item is viewed, the search.CGI script calls a special view_item routine that records that fact in a database table. Each record in the table corresponds to one item / user pair. If the user has already viewed the item, the view item routine updates the information stored in that record, otherwise it creates a new record. It uses the standard user tracking system that the rest of the system uses, and this is yet another way we learn more and more about a user. One problem is recording the amount of time that the user has viewed the item. We simply assume this to be the time between the views of two different items. If this time for a single view is over a certain threshold, we do not record it, as the user may have left the site without viewing another item, walked away from their computer for a few minutes, or other similar scenarios. This is one reason why the amount of time spent is only twenty five percent of the vote.


6.2 Collaborative Filtering


We decided to implement a collaborative filtering mechanism to provide a recommendations page to the user. This recommendation page will be like the searching mechanism, and provide the top N ranked items for the particular user, where N is a user selectable number.

The equation we used to calculate the recommendations is:

Pa,j is the preference for the active user (a) for item j. N represents the number of users. The function w(a,i) returns a weight which represents the similarity between two users. Vi,j is the vote for user i on item j.is the mean vote of the user i as shown in the second equation, where Ii is the set of all the votes the user has made. K is a normalizing factor of all of the weights, w(a,i) such that the absolute values sum to unity. [16]

Much of the research that has gone into collaborative filtering is based on finding an accurate and efficient algorithm for the weight. In our project we used a weight as follow:

Va,j is the vote for user a on item j and Ix is the set of items the user has


voted on [16]. In the actual implementation of the algorithm, it was noted that theterms could easily be calculated once for each user, and this optimization


was used. Furthermore, the weight between user a and user i is equal to the weight between user i and user a, so the values are cached assuming there is enough memory to do so.

The squared terms in the denominators are used to normalize the votes so that users that simply have more votes won't necessarily be more similar to other users. There are several other techniques for normalizing the votes, including the number of votes and the absolute sum of the votes. [16]

The maximum time required to calculate the preference for a user grows linearly by the number of items multiplied by the number of users. The time required to compute the preferences for the entire user base will therefore grow in exponential time. At some point the processing will need to be broken down into smaller units. Most of the calculations for the collaborative filtering engine are done in a nightly batch program. This program recalculates the recommendations based upon the activity for the day. This provides a scalable approach since subsets of users can be calculated on different physical machines, and the the results combined. An even further optimization can include the caching of various sub-parts of the equation from previous days for unchanged users.

In the future, if the collaborative filtering system is proven to be somewhat accurate, it may also be used to generate the home page for returning customers. This accuracy can be computed by selecting a user, artificially blanking out one or more votes they have for various items, recalculate the expected recommendation, and then comparing the calculated to the actual.

In order to test the collaborative filtering, we created 5 fictitious users, and had them browse the test items in different manners. The tests showed that the algorithms worked the way they were written. The results for those users can be found in Appendix C. Most [16, 10, 14] research on collaborative filtering seems to indicate that the accuracy of such systems improves as the number of users increases. In the future, we hope to grow our user base large enough to test this.


6.3 User Customization


We have done several things in the area of user customization. These personalization items are driven by the user tracking system we have described in other parts of this paper.

The quick category listing on the web site is made up of the 5 categories the user has visited the most. This listing is available on the left of nearly every web page the customer can view. By giving this quick listing, the user does not need to visit the full category listing each time they wish to search for items by category, thus reducing the path to the items by one click. If the user has never visited any of the categories, the quick list is simply made up of a default list by the system administrators.

We also employed a couple of layout modifications. If in the past the user has bought an item, we provide a customer login on the left of most web pages. When a customer logs on, they are brought to the customer home page that allows them to check on order status, view their favorites list, and alter their personal information. Likewise, if the user has been known to log in as a vendor, a vendor login is available that brings the user to the vendor home page.


7. Real World Issues


We actually implemented this system and put it online. This implementation and deployment included many things beyond the development of the software.


7.1 Domain Name


The first real world issue we had to deal with was the selection of a domain name. This took us several hours because there are so many dot com domain names already registered. Register.com [8] alone boasts over two million registered domain names. Originally we were hoping something that would convey the idea of a mall, or a collection of vendors. We tried everything from online-bazzar.com to internetyardsale.com with no luck. Unfortunately, the night wore on, we got tired, and started to try just anything. For fun we tried shoppingcow, and it was available. Nearly instantly we both saw the marketing potential of a cartoon cow pushing a shopping cart (Appendix A, Figure 2). Within minutes, the domain name was registered, and in a matter of hours the change had propagated across the internet DNS servers.


7.2 Web Server


After having a name, we needed someplace to put the website. We looked at dedicated hosting companies, but they were out of our price range. We then looked at virtual hosting companies, and selected web2010 [9]. Virtual hosting is where one physical server is split up into many logical web servers. When you access the web site from one of these virtual servers from a web browser, it appears as a standalone web server.


7.3 Taxes


In order to run a legitimate business, we needed to register with the IRS to get a tax payer number. To our surprise, it was one of the easiest parts of the process, and we merely needed to make a phone call. It seems the IRS is all too happy to get new tax payers.


7.4 Encryption


It is nearly impossible to do business on the internet if the customers do not believe that their personal information will be safe. Due to this fact, we needed to use encryption for the order verification phase of the order. To do this, we purchased an SSL certificate. To get this certificate, we needed to fax over information, digitally sign an email, and then wait for the key to come back to us to install on the web server.

Once the key was installed into the web server, an encrypted version of any of the web pages on the site could be accessed via an https request instead of the usual http request. This made it painless to integrate encryption into the site.

One other place that encryption was used was in the passwords. A one way encryption algorithm was used to scramble any passwords coming into the system. This encrypted value was then stored. Later, when someone wanted to log in, we would simply encrypt the password they gave and compare it to the encrypted version stored in the database.


7.5 Credit Cards


We then applied for a merchant account for credit card processing. It was unfortunate that we waited this long to do this, because it wasn't until this point that we learned about something called "credit card factoring". This is where one company accepts credit card payments for another. Unfortunately, it is greatly frowned upon by the credit card industry, and we needed to change our business model. We were forced to change to a consignment based business model. Instead of the vendor sending the item, we needed to sell it and send it to the customer ourselves, thus satisfying the restraint that any transaction must be between us and the customer, with no middleman.


8. Conclusion


As in nearly every software engineering project, there is room for improvement in this project. First, scalability issues need to be addressed. Currently the system is a two tier system, with the CGI scripts generating the web pages, and performing the business logic as the database stores the data. It would be nice to separate the web page generation and business logic. Once this is done, the business logic could be placed on separate machines to reduce the load on the web servers.

Giving each vendor a personal page, perhaps as a sub domain of shoppingcow.com, could be used as an incentive to attract more vendors. The vendors would then have their items listed on their own personal page that better represents them, and on the main shoppingcow web site.

Many vendors already have their own web site, and it would be nice if they could integrate the shoppingcow shopping cart into their site. For instance, next to the description of each item on their site they could have a "Add to shoppingcow cart" link that would add that item to a shoppingcow.com shopping cart.

Customer relations will need to get better. The comment system may not always be adequate. A system that can track a single problem with all of the comments, dates, times, and reference orders would be preferable.

Once the site has grown to a sufficient size, it will be necessary to start extracting market research data to determine the best course for the business to follow. An easy way to get at this data will be required.

Once we have that market research data, a better algorithm for calculating the implicit votes from the users' actions should be created.


References


[1] Craig Larman. Applying UML and Patterns. Prentice Hall PTR, 1998. New Jersey


[2] Netcraft - http://www.netcraft.com


[3] MySql - http://www.mysql.org


[4] Budi Yuwono, Savio L. Y. Larn, Jerry H. Ying, Dik L. Lee. A World Wide Web

Resource Discovery System. Fourth Internat. WWW Conference, Boston, Dec. 1995.


[5] Joesph Kramer, Sunil Noronha, John Vergo. A User-Centered Design Approach to Personalization. Communications of the ACM, 43(8),

August 2000


[6] Udi Manber, Ash Patel, John Robinson. Experience with Personalization on

Yahoo! Communications of the ACM, 43(8), August 2000


[7] John McCarthy. Phenomenal Data Mining. Communications of the ACM,

43(8), August 2000


[8] Register.com - http://www.register.com


[9] Web2010 - http://www.web2010.com


[10] Upendra Shardanand, Pattie Maes. Social Information Filtering: Algorithms for

Automating "Word Of Mouth". MIT Media Lab, ©ACM, 1995.


[11] Ringo - To try out Ringo, email ringo@media.mit.edu with body of message: help


[12] David M Nichols. Implicit Rating and Filtering, Computing Department, Lancaster University, Lancaster LA1 4YR, UK


[13] Mike Perkowitz , Oren Etzioni. Adaptive Web Sites. Communications of the

ACM, 43(8), August 2000


[14] Isabel F. Cruz, Lijun Leo Liu, Tony Wu. WEBSA: Database Support for

Efficient Web Site Navigation ADVIS Research Group. Computer Science

Department. Worcester Polytechnic Institute


[15] Doug Riecken. Personal End-User Tools. Communications of the ACM,

43(8), August 2000


[16] John S. Breese, David Heckerman, Carl Kadie. Emperical Anlysis of Predictive

Algorithms for Collaborative Filtering. Microsoft Research - Redmond WA

98052-6399 - Appeared in proceedings of the fourteenth conference on

uncertainty in artificial intelligence, Madison, WI, July 1998. Morgan

Kaufmann Publisher


[17] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-

Wesley Publishing Company, Reading, MA, 1999





Appendix A



Insert Graph, Remove Blank Page!


Figure 2





Page 9