Jebeh Kawah, Faisal Razzaq, Enzhou Wang
Mentor: Shui-Lung Chuang
CS 511 - Spring 2005
Project #7 Data Record Extraction
Current web data record extraction methods [1, 2, 3] are mostly based on HTML syntax analysis. Algorithms and heuristics are developed to identify HTML tag patterns. The boundary of a data record is determined by tag patterns. Such an approach makes minimal assumptions about the content of web pages to be processed and is applicable to a wide range of data extracting tasks. However, in many web data record extraction scenarios, we have more information we can utilize than just the HTML syntax structure. For example, when we search a keyword at www.amazon.com, we can be sure that some of the returned data records will contain the keyword. When we aggregate information from multiple sites, we can search the same keyword on multiple sites. The correlation between the search results from multiple sites is able to help identify data records. This is because the repetition of the data records among different sites establishes a pattern that can be used to obtain information about other similar data records. Based on these observations, we propose a new algorithm CCM (Content-Based Cross-Site Mining Web Data Records) that takes advantage of cross-site reference information to tackle the data record extraction problem.
When we search certain keyword at a single site, we know for sure that the result data records must contain the search keyword, but we are not so certain about whether a string that contains the search keyword is part of a data record. However, if we search the same keywords at multiple sites and a certain string containing the search keyword appears in search results of several sites, it is very likely that string is part of a data record. CCM is based on such an observation that there can be data records that are easily identifiable when doing cross-site keyword search. We will identify data region and record boundary based on those records and conduct further extraction. Assume we search a keyword on multiple sites. A keyword can be a word, a phase, or some combination. We will get one or more result page from each site. The input of this algorithm is a list of pages grouped by sites. The algorithm we propose has four major steps:
By key phrase, we mean a content phrase in HTML that contains the keyword we search for. For example, when we search “Java” on www.amazon.com, we may get a simplified result page like:
<tr><td>Java 2: A Beginner's Guide</td></tr>
<tr><td>Head First Java</td></tr>
The key phrases will be “Java 2: A Beginner's Guide”, “Core Java”, and “Head First Java.” We will extract a key phrase list for each site we searched. For example, by searching “Java” on www.bn.com, we get
<tr><td>Java in Easy Steps</td></tr>
<tr><td>Head First Java</td></tr>
The key phrase list is “Java (SparkCharts)”, “Java in Easy Steps”, and “Head First Java.” From www.bookpool.com, we can get
<tr><td>Head First Java</td></tr>
Once we have key phrase list for each site, we will combine the lists to generate a ranked key phrase list. A key phrase will be ranked by how many times it appears in individual key phrase list. In our three-site search example, the ranked key phrase list is
|Head First Java||3|
|Java 2: A Beginner's Guide||1|
|Java in Easy Steps||1|
We will start processing individual pages starting from the page that has the highest ranking score: the summation of all key-phrase rankings of the page. We can randomly pick one if there is a tie. In our example, we will start from the page from www.amazon.com.
We call the highest ranked key phrase “seed.” In our example, the seed is “Head First Java.” We assume that the seed we choose is part of a data record we are searching for. We will try to identify data record boundary by growing this seed. The method used to "grow" the seed is essential to CCM algorithm and can have different implementations. We will describe the method we used in Implementation Detail sections.
Once we identify the record boundary, we will apply the data region and record boundary information to all the result pages of this site. For each page, we will keep track of how many key phrases for that page have been identified. If the number of phrases not being extracted is above certain threshold, it means we may have missed an additional data region. We will use that key phrase as seed and repeat step 3 to help extract records from those missed data regions.
(3) and (4) will be repeated for all sites.
We choose PERL initially thinking CCM algorithm involves string comparison. PERL's built-in regular expression facilities will be helpful. It turns out PERL is the right choice for us, but for a different reason. PERL has mature packages that create and manipulate tree structure based on HTML document. Working on the tree representation proves to be very effective.
We used two packages: HTML::TreeBuilder to create the tree; HTML::Element to manipulate each element. The tree created by HTML::TreeBuilder is composed of HTML::Element or string. Besides the operations that are common to a tree data structure. An important attribute of a HTML::Element is its address. Address of an element is the string representation of the location of that element. For example, an element may have address: 0.2.10.8. An element can also detach from its tree. Both address attribute and detach operation help to simplify our implementation.
Getting the seed is straightforward, but it is challenging to grow the seed to find data region and boundary tags. Based on the observation that we can identify data region and record boundary if we know two neighboring data record. Our goal is to use seed information to find two neighboring data record nodes.
Suppose we know the seed phrase, we simply pick its immediate neighboring phrase as neighbor. To simplify, we always call the left one as seed, the right one as neighbor. We keep traversing up the lineage of seed and its neighbor until common ancestor is reached. Common ancestor node will be the data region node. One level under data region node is what we call level1 node, which will be the basis for us to divide data boundaries. For simple cases, one level1 tag will be a boundary, for example, a <tr> can be a boundary tag. We know a case is simple if seed_level1's immediate right node is neighbor_level1. If there are more than one tag separating data records, we will first find the level1 node of the first and last phrase node of a data region. Since we have the data region node, we go through each of its child to find the first_level1 and last_level1 node. We will start with first_level1 and seed_level1 (seed node is always the left one compared to neighbor node). We move left node by node, as long as both nodes have the same child structure, we consider that node part of the boundary tag. We simply use the number of child tags to test about child structure similarity. Similarly we can use neighbor_level1 and last_level1 and go right to find the right part of boundary tag. We will get the separating tag pattern by combining the left part, tag of seed node, and the right part.
With region address and boundary tags ( we call this combined information region_info), we can start to extract the page. It becomes a case of pattern matching. When we find a candidate matching record, we need to check if that candidate is indeed a record. For example, in the simple case, we have boundary tag as <tr>, but there could be some <tr> at the end about customer support information that also has the search keyword. To check a candidate, we keep seed_relative_address, which is the relative address of the phrase node to the region node. For example, if 1.0.2 is the region address and 184.108.40.206.3 is the phrase node address, 0.3 will be the relative address. With each candidate level1 node we get, we try to get a phrase node using the relative address. If we succeed, that level1 node is a record node. In the original implementation, we tried to find the keyword in each candidate node, but it turns out a record may not necessarily have the keyword, for example, a record can have J2EE, but not Java.
To simplify extracting multiple data regions, we detach the data region node from the original tree and run the extraction again. The extraction stops until we have extracted certain percent of phrases or we are making no progress.
Originally we used regular expression to parse HTML text to get the ranked phrase list. Later we found out that the text we get from HTML::Element has done the special character conversion, for example, & is converted to &. This causes problem to use ranked phrase list to find the seed node. In the current implementation, we obtain phrase list from the tree representation to solve this problem.
A problem of using HTML::Element is that the children of a HTML::Element can either be HTML::Element or just a string. During the traversal process, we will need to check whether a node is a HTML::Element or not.
The extraction result is a list of HTML::Element. It provides basis for further data field extraction. In our test, we only display the text in the elements. After further extraction, delete method needs to be called on all the HTML::Element to avoid memory leak.
To run the test on custom data, simply run CCMTest.pl --keyword=keyword --datadir=datadir, for example CCMTest.pl --keyword=Java --datadir='..\data\book\java'.
Our initial test is on the five book sites we listed in proposal: www.amazon.com; www.bn.com; www.bookpool.com; www.half.com; www.powells.com. The keyword we used is "Java". The first one we have working is www.bn.com, which turns out to be the simpler one. Most other sites presents their unique challenges. www.amazon.com has two data region on its search result page: one lists the most popular titles, the other lists the actual search results. www.bookpool.com has data record that does not contain the keyword: having J2EE, but not Java. www.powells.com lists the titles using alphabetic order while most other sites order titles by popularity. We improved our implentation by making it work on each site. All the improvement is generic and not specific to one site. The result is surprisingly good: we are able to extract data records from these five sites with 100% precision and recall. After the algorithm worked on the five sites, we tried another three sites to evaluate the robustness of our algorithm: www.abebooks.com/; www.bartleby.com; books.ebay.com. Without any modification to the code, the extraction succeeded on these three sites with 100% precision and recall. We also tested the CCM algorithm on a different domain: sites selling electronics. The sites we tested are: www.bestbuy.com ; www.circuitcity.com ; www.radioshack.com . We search 'Camcorder' on those site. The results are less encouraging. The extraction is successful with 100% precision and recall at www.radioshack.com, but we failed to extract data records from the other two sites. However, we feel the failure has more to do with the robustness of our implementation than the basic idea of CCM.
Due to time limit, we did not do a profiling to evaluate performance of our implementation.
Based on the intial testing results, Cross-site Content Based Algorithm provides a promising alternative to syntax based data extraction algorithms. It is less likely to be affected by presentation style changes of web pages. The effectiveness of extraction will improve as it is used on increased number of web pages. The data records we have extracted can be fed back into ranked phrase list to help further extraction. The implementation needs to be tested on more sites and pages to improve its robustness, but our content-based approach to data extraction has shown to be a direction worth further exploring.
1. D. Buttler, L. Liu and C. Pu. A fully automated extraction system for World Wide Web. In Proceedings of the 2001 International Conference on Distrubuted Computing Systems, 2001.
2. D. W. Embley and Y. Jiang and Y.-K. Ng. Record-boundary discovery in Web documents. In Proceedings of SIGMOD, 1999.
3. B. Liu, R. Grossman, and Y. Zhai. Mining data records in Web pages. In Proc. of the ACM SIGKDD, 2003.