联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2020-10-28 10:39

Introduction

you are asked to answer several questions to demonstrate your level of understanding

on multiple topics, including distributed database, data warehousing, data integration

and data quality management. Meanwhile, coding is required for some questions to

show your problem-solving ability.

Tips & Suggestions:

1. It is highly suggested to complete Prac 3 before working on the coding part of

this assignment (Part 4). Although the assignment is independent to pracs, the

code introduced in Prac 3 can be the starting point of this assignment as they

work on similar tasks.

2. Implement your code in SQL, Java or Python, choose the one you feel

comfortable with and stick to it till the end of the assignment. The code must

contain basic comments so that tutors are able to understand the structure of

your code and the objective of each snippet.

WARNING: The reuse of code from practicals are allowed.

Preliminary: Dataset Description

In this assignment, we have four datasets about book information from four different

sources. The data schemas are listed below:

Book1

(id,title,authors,pubyear,pubmonth,pubday,edition,publisher,isbn13,language,series,pages)

Book2

(id,book_title,authors,publication_year,publication_month,publication_day,edition,

publisher_name, isbn13,language,series,pages)

Book3

(ID,Title,Author1,Author2,Author3,Publisher,ISBN13,Date,Pages,ProductDimensions,SalesR

ank,RatingsCount,RatingValue,PaperbackPrice,HardcoverPrice,EbookPrice,AudiobookPrice)

Book4

(ID,Title,UsedPrice,NewPrice,Author,ISBN10,ISBN13,Publisher,Publication_Date,Pages,Di

mensions)

Part 1: Data Management and Query [6 marks]

Read the above schemas carefully and understand the meaning of each attribute. If you fail to

understand some of them, check the data under it or Google its meaning (especially for some

abbreviations, like ISBN). Answer the following questions based on your understanding.

Question 1: [2 marks] Given that four datasets are stored in one relational database

as separate relations. For a query “Find top 100 books that have the best sales, return

their ranks (sorted in ascending order), titles, publishers and number of pages.”, which

schema(s) can answer such query? Write down the corresponding SQL query.

Question 2: Given that Book2 is stored in a distributed database A, and two queries

that are most frequently asked on A are:

? Find all books whose publisher name is “XXX” (or among multiple

publishers), return their book titles and author info.

? Find all books that are published in a given year, return their book IDs,

languages and number of pages.

Answer the following questions:

(1) [2 marks] Assuming that we horizontally fragment the table into three fragments

based on the following predicate:

Fragment 1: pages ≤ 100

Fragment 2: 100 ≤ pages ≤ 800

Fragment 3: pages ≥ 600

Is this predicate set valid? If so, please explain the insert process if we want to insert a

new record into Book2 (using plain English). If not, please generate a valid predicate

set using min-term predicate (show the calculation process). Also, explain the insert

process for a new record after the valid predicate set is made.

Part 2: Data Integration and Quality Management [12

marks]

Given that the data warehouse loads data from the above four sources (Book 1,2,3,4),

you are asked to integrate their data and address various data quality issues. The

actual data of book lists are given as CSV files, namely “Book1.csv”, “Book2.csv”,

“Book3.csv” and “Book4.csv”. Note that in a CSV file, the attributes are separated by

comma (,). If two commas appear consecutively, it means the value in the

corresponding field between two commas is NULL. Furthermore, if an attribute field

contains comma naturally, the field will be enclosed by a double quote ("") to

distinguish the commas inside the attribute with the outside comma separator. For

example, a record in Book2 is as follows:

1725,Informix Unleashed,"John McNally, Jose Fortuny, Jim Prajesh, Glenn Miller",

97,6,28,1,Sams,9.78E+12,,Unleashed Series,1195

According to Book 2 schema, we can infer the following fields:

id=1725,

book_title=Informix Unleashed,

authors= John McNally, Jose Fortuny, Jim Prajesh, Glenn Miller, ...

isbn13=9.78E+12

language=NULL,

series=Unleashed Series,

pages=1195.

Here, since there are commas in the “authors” field, the whole field is enclosed by a

double quote. Also, since there are two consecutive commas before “Unleashed

Series”, it means that the language is NULL.

In this part, you are asked to answer the following questions through programming (if

“code required” is specified). Your answers to the questions must be based on code

results. Please save all the code you wrote. When answering a question, please specify

the location of the corresponding code for that question or name your file as

“Question5”, “Question6”, to direct the correct file.

Question 3: As the book list schemas provided in Preliminary, design a global

conceptual schema which combines the common attributes among all four schemas.

Your design should include every piece of information that four schemas share in

common. In other words, if a column can be found or derived from every schema, it

must be included in your global conceptual schema.

(1) [2 marks] Write down the global conceptual schema. The format should be

similar to the schemas in Preliminary.

(2) [3 marks] Integrate “Book3.csv” and “Book4.csv” data according to the global

schema you defined (code required). The data should be sorted by ISBN13 in

ascending order. You should use either of the two approaches mentioned below.

a. If you perform the integration on Oracle database, please create a table named

“FullBookList” using your schema and insert data into it. Take a screenshot of your

table schema from SQL Developer, and another screenshot of first 20 records in the

table. Both screenshots should include your student ID as database username. Add the

screenshots to your solution document, and include your SQL scripts as a text file in

your final submission.

b. If you perform the integration using Java/Python, write the integrated dataset to a

CSV file named as “FullBookList.csv”. Include this file in your final submission.

Take a screenshot of your CSV file that shows first 20 records and add the screenshot

to your solution document.

Normally, we would expect to have various data quality issues in an integrated

dataset. For example, by checking ISBN13 code in “FullBookList”, we can find

multiple pairs of books with the same ISBN13 code, like “9781296126568”,

“9780679887911”, “9781298248848”, etc. As it is very common that the same book

is recorded by multiple sources, it is crucial to identify and merge duplicated records

during the data integration process, which relies on the record linkage technique.

In this regard, question 6 asks you to perform a record linkage task on “Book1.csv”

and “Book2.csv”. We provide a human-labelled gold-standard dataset (refer to Prac 3

Part 2.2 for more information about gold-standard), named as “Book1and2_pair.csv”,

which lists all correct matchings between Book1 and Book2. It will be used in the

following tasks. Its schema is as follows:

Book1and2_pair (Book1_ID, Book2_ID)

Question 6: [4 marks] Perform data linkage on Book1 and Book2 using the methods

mentioned in Prac 3. When linking their results, use Jaccard coefficient with 3-gram

tokenization as the similarity measure and perform the comparison only on the “book

title” field (double quotes that are used to enclose book titles should be removed

before the linkage). Book pairs whose similarity is higher than 0.75 are regarded as

matched pairs. Compare your output with the gold- standard dataset and write down

the precision, recall and F-measure (code required).

Question 7: [3 marks] In addition to the duplication issue, we want to explore other

data quality issues remained in datasets. Create a sample dataset from “Book3.csv”

containing all records whose id is the multiple of 100 (i.e. 100, 200, 300, ...). Among

all samples, how many fields (a field is a cell in the table) containing NULL are

present (here, NULL is recorded as an empty value in this field)? Calculate the Empo

(error per million opportunities) according to your samples (Empo= number of

NULLs / number of fields) (code required). (Hint: you can sample the records

manually to validate the correctness of your code results)


版权所有:留学生编程辅导网 2018 All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。