Trabalho Pi Web Scrapper

Embed Size (px)

Citation preview

  • 8/2/2019 Trabalho Pi Web Scrapper

    1/3

    WebScrapperA multiagent system for web crawling

    Neves, Joo

    Departamento de Informtica

    Universidade do Minho

    Braga, Portugal

    [email protected]

    AbstractThe present work demonstrates the usage of acooperative Multiagent system for recovering properly formated

    information from auction and sales portals. The JADE (Java

    Agent Development Environment) [1] toolkit was used to

    accomplish this task. Weve choosen to use a multi-agent

    arquitecture because the different tasks envolved in this process

    can be executed independently with simple and task oriented

    agents in a cooperative environment.

    Keywords: multiagent systems; web crawling; datawarehouses;JADE;web portals.

    I. INTRODUCTIONIn todays internet times the proliferation of auction and

    sales portals has increased the offer to the general public,eliminating at times the intermediate entities that hadpreviously been the primary interface between the manufactersof goods and the consumer in the distribution chain. This hascreated a mixed environment where private individualscompete with companies selling the same kind of items. Thishas particular importance in the business of used items: car,

    houses, electronic equipment etc. We propose a model thatallows the creation of a system in the form of a prototype tocollect properly formated data from online auction and salesportals to make it usefull and thus transforming the extracteddata into usable and usefull information. The system can act asa midleware system functioning as a gateway for automaticretrieval functions. Its collected information can later be usedto feed other systems such as datawarehouses, portals, searchengines etc.

    II. STRUCTUREOFWEBSCRAPPERA. Motivation

    The project described does have some similarities withshopbots[3] as defined by Fasli. However shopbots have amore broad usage and our objectives were considerablysimpler. Our purpose was to scrapp from the html returned inweb pages the relevant information according to very precisecriteria. Weve choosen a multiagent arquitecture for thefollowing reasons: to decouple the different tasks envolved inthe collection of data from online resources; to create simpletask oriented autonomous agents that together make it possibleto construct a unified system capable of accomplishing the

    fairly complex task of extracting data from online sellingsystems; to be able to escalate the system so has to increase thenumber of entities collecting data according to the needs of theapplication. The JADE toolkit was chosen because it offers aready made platform to develop on top and is compliant withFIPA standarts. Its funded by large corporations and its basedon open source phylosophy. This was a research project that

    might evolve to be a final aplication so the costs of investmenton software were also accounted for in the decision process.

    B. WebScapper system

    Figure 1.

    The system is divided into three types of agents: a master,a crawler and a specialized extractor that is constructedaccording to the site that is subject to data scrapping.

    The Master/Crawler cooperate to the schedulling job offetching of data from a presented URL. They were madeseparate so has to be able to have several fetching agents(crawlers) and to be able to make them anonimate if necessary.This is sometimes a requirement that depends on the amount ofdata to be extracted and how paranoid the external sites are toautomatic non-human browsing. Based on this they can blockaccess. The master agent (we are assuming that only one isnecessary) acts as a maestro that is responsible for distributingURLs to be fetched by the crawler community.

    Datinfor - was the main sponsor for this project: http://www.datinfor.com

  • 8/2/2019 Trabalho Pi Web Scrapper

    2/3

    The extractor agents work on the retrieved data andaccording to their definitions, create the queries to remote sites,parse the retrieved contents, might create adicional URLs to beretrieved from the Web Sites. They work in stages that we willelaborate later on. To glue everything together weve chosen tosave the data collected to database tables. We also use the samedatabase to store some auxiliary tables that are used to controland manage the comunication between the agents.

    In the prototype the crawler agents do not need to accessthe database. The master/crawler exchange messages in FIPAcompliant protocol. The information exchanged between themis URL/HTML for each. The URL to be fetched is transmitedto the crawler by the master and the crawler replies with thefetched data in the form returned by the crawled site. Thisdesign choice was made to be able to distribute the agents tomany sites with minimum requirements regarding itscomunication infraestructure. The system should be able to useexisting security systems (firewalls, proxies etc). Theprototype was conceived to work in batch and to have its finaldata stored in a datawharehouse.

    The interfaces from most online auction and sales sites

    differ from each other significantly and only very few of themoffer methods for machine to machine integration. In order tobe able to integrate these sites weve created the extractoragent. The most intrincate and complex programming job ofthe whole system was on the extractors agent code. Also it isthe only agent that because of its fairly complex structuredenotes some perceived intelligence. Initially it was in ourplans to have a general extractor that would process regularexpressions on retrieved data and the final result would be theextracted information. It was created a small prototype to try toaccomplish this. After initial hands-on it a decision was madeaway from it because the necessary structure was too complexto maintain and follow and that colided with our intention todesign a simple system. Also, the restriction to only use regular

    expressions resulted in a great loss of richness that is offered bya complete language like java when opposed with regularexpressions scripts for data extraction. Our final decision wasto create an agent for each site that would be responsible todata collection and extraction activities.

    In the analysis done it was identified three stages thatneeded to be processed when retrieving data from onlineshopping sites: query construction; list processing; detailprocessing. The first stage, query, is necessary to identify theinput variables and syntax of the site to construct the properURL for data retrieval. Also it is this stage that initiates the datacollection activities of the whole system; for the second stage,list processing, the agent has to be able to extract the relevantlinks that permit the access to the details of the retrieved

    records and also the pagination links in case there is a need toextract all pages from the site (relevant for the original query);the last stage, detail processing extracts the final data from thesite which in our prototype is going to be added to adatawarehouse. In certain situations the last step can be skipedwhen the list returned from the site has enough information tofullfill the requirements of the retrieval task. In these cases thelast two stages will merge and function as one.

    The comunication between the master (the one agent thatsupervises the actual external data extraction) and theextractors is done via a state table. In this table we maintain thetype of extractor record, to distinguish between differentcollecting sites, an URL, the actual retrieved data, the agentsthat executed the jobs (extraction crawler, processing extractor) the extraction stage and several flag indicators. Newentries are always created by extractor agents and are updatedin turn by the master and extractors as the request goes thru theprocessing phases.

    The comunication between master/crawler is done by usingstandard FIPA-ACL performatives. A PROPOSE is sent to allcrawler agents with a URL. An INFORM is sent from thecrawler informing the master it will fetch the URL. The mastersends a CONFIRM to the crawler that acknowledges that itaccepts the job and the crawler will send a reply with theretrieved data and the performative AGREE.

    One issue that we needed to account for in this project wasthat of coordination of the comunication between the diferentagents acting on the environment. We will separate this itemtwofold: coordination between master/crawler and coordination

    between master/extractors. For the first relationship,master/crawler, it was defined the master as the entityresponsible for the distribution of work amoung crawlers. Thecomunication protocol described above takes care of assigninga task to only one agent at a time. The master agent collectsjobs from the requests table that are flagged as to be crawledand hands them to the community of registered crawlers. Whenthe task of fetching data is finished the master writes the resultsin the requests table and updates a flag accordingly.

    The relationship and interaction betweenExtractors/Masters is like the following: when the extractorneeds data from the internet will create an entry in the requeststable indicating the URL it wants extracted with a raised flagfor crawling processing.

    The interaction of the extractor with the requests table is abit more complex for it is this table that controls the course ofaction to be followed by the extractor agent. It takes two roles:one that creates new requests on the table; the other thatanalyses the requests and acts according to the stage the requestis on. It is also responsible for updating a flag that indicates therequest has been processed. As the tasks are performed therequest is continuosly updated by the executing agent.

    The input to the system on this prototype is done by a userwindow that initiates the query and starts the whole systeminteraction. The output of the system is placed on a databasetable that precisely defines the collected data according to apre-defined data structure.

    III. CONCLUSIONSWith the present work it was possible to create a simple and

    functioning system that permits the extraction of data fromexternal auction and online sales sites. The multiagentarquitecture allows for an easy process of escalation as thenumber of requests grows. The designed system does not

  • 8/2/2019 Trabalho Pi Web Scrapper

    3/3

    provide a final solution per si but allows it to be used as amidleware for a wide range of solutions: federated searchsystems, data collection crawlers, on line coaching systems etc.

    IV. FUTUREWORKA ontology will defined for data extraction, that will create

    a standard method to define the query expressions according toa specific domain of knowledge. We will also explore the Web

    Services interface provided by the JADE platform so as tocreate standardized interface for online services. This interfacecould then be used as a transducer [4] to onlines auction andsales sites and expose the retrieved information to a generalsearch engines using standard search protocols (SRU, z39.50,Opensearch).

    REFERENCES

    [1] Java Agent development framework. http://jade.cselt.it[2] Bellifemine, Fabio, Giovanni Caire, Dominic Greenwood, "Developing

    multi-agent systems with JADE". Wiley: Sussex, 2007 pp. 19-20.

    [3] M. Fasli, Shopbots: A syntactic present, a semantic future, IEEE InternetComputing, 10:6 (IEEE Press), 2006.

    [4] Nikraz, Magid, Caire, Giovanni, Bahri, Parisa A., "A Methodology forthe analysis and design of multi-Agent systems using JADE". MurdockUniversity: Rockingham, 2006, pp. 10-11.

    [5] Wooldrige M., An Introduction to Multiagent Systems,John Wiley &Sons, ISBN 0 47149691X, 2002.