联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Java编程Java编程

日期:2023-10-29 10:51

10/27/23, 6:33 AM Fall 2023 CS512 Assignment - CS 512 - Illinois Wiki


Fall 2023 CS512 Assignment


Coding Assignment (Distributed Tuesday Sep. 12, 2023, Due Thursday Oct. 26, 2023, Extended to Sunday

Oct. 29, 2023)


Before Dive in


All answers must be in pdf format.

This is an individual assignment. You can discuss this assignment in Piazza, including the performance your implementation can achieve on the provided dataset, but please do not

work together or share codes.

Reference libraries or programs can be found online.

You can use C/C++ or Java or Python2/3 as your programming language.  A more detailed project organization guidance can be found at the end of the assignment.


Late policy:


10% off for one day (Oct. 27th, 11:59PM)(Oct. 30th, 11:59PM)

20% off for two days (Oct. 28th, 11:59PM) (Oct. 31st, 11:59PM)

40% off for three days (Oct. 29th, 11:59PM) (Nov. 1st, 11:59PM)

A section titled Frequently Asked Questions can be found at the end of the assignment, and we will keep updating it regarding questions in Piazza and office hours.

Please first read through the entire assignment description before you start.


Problem Description


As we learned from the class, traditional topic modeling could suffer from non-informative topics and overlapped semantics between different topics. As a response, discriminative topic mining incorporates user guidance as category name and retrieve representative and discriminative phrases during embedding learning. In the meantime, these category-name guided text

embeddings can be further utilized to train a weakly-supervised classifier in high quality.

Specifically, we need to finish four steps as follows.

Step 1: Download training dataset on news and movies(see links in Step 4), use AutoPhrase to extract high quality phrases.

Step 2: Write or adopt CatE on segmented corpus to find representative terms for each category.

Step 3: Perform weakly-supervised text classification with only class label names or keywords. Test the classifier on two datasets.

Step 4: Investigate the results and propose your way of using prompting of pre-trained language models to improve it. Implement your method and compare it with the one in Step 3.

Step 5: Submit your implementation, results, and a short report to Canvas.


Problem Data


You can find the problem data in this link. Here are the detailed files when you downloaded the data.


Name Num of

documents

Category names Training Text Testing Text #Validation Labels


News 120000 news_category.txt news_train.txt news_test.txt first 100 of

news_train_labels.txt

Movies 25000 movies_category.txt movies_train.txt movies_test.txt first 100 of movies_train_labels.txt


Step 1: Adopt AutoPhrase to extract high quality phrases


In this step, you will need to utilize AutoPhrase to extract high quality phrases in train.txt of both datasets provided. The extracted phrase list look like (the example here is different from

homework test data):


Score     Phrase


0.9857636285     lung nodule

0.9850002116     presidential election

0.9834895762     wind turbines

0.9834120003     ifip wg

....


Step 2: Compute category name guided embedding on segmented corpus


Use your segmentation model to parse the same corpus, recommended parameters for segmentation is  HIGHLIGHT_MULTI=0.7 HIGHLIGHT_SINGLE=1.0. An example segmented

corpus can be:

<phrase>An overview</phrase> is presented of the use of <phrase>spatial data structures</phrase> in <phrase>spatial databases</phrase>. The focus is on <phrase>hierarchical

data structures</phrase>, including a number of variants of <phrase>quadtrees</phrase>, which <phrase>sort</phrase> the data with respect to the space occupied by it. Such

techniques are known as <phrase>spatial indexing</phrase> methods. <phrase>Hierarchical data structures</phrase> are based on the principle of <phrase>recursive

decomposition</phrase>.

Then you will write your own CatE or refer the existing ones to compute the phrase embedding as well as category-guided phrase mining. You will need to submit category and their top-10

representative terms in {category_name}_terms.txt.


For example, in technology_terms.txt, you want have the first line as category name embedding, the following 10 lines would be category representative term embeddings


technology 0.720378 -0.312077 0.811608 ... 1.096724


terms_of_usage_privacy_policy_code_ 1.439691 0.508672 -0.958150 ... -1.277346

...

10/27/23, 6:33 AM Fall 2023 CS512 Assignment - CS 512 - Illinois Wiki

2/3


Tip: You can concatenate train.txt and test.txt into a larger corpus for phrase mining and category-named guided embeddings.


Step 3: Document Classification with CatE embeddings  


In this step, you will need to build a weakly-supervised classifier (e.g. WeSTClass, LOTClass) on top of the term embeddings or topic keywords you obtained from the previous steps. The

only supervision is category names provided in the datasets.

For example, in news, we have following category names in news_category.txt

politics

sports

business

technology

To help validate your results, we also provide labels of first 100 documents in both dataset in news_train_labels.txt and movies_train_labels.txt. Feel free to discuss the validation

performance you get in this step on Piazza.


Tip: You can try label names or expanded keywords from the cate embeddings as weak supervision. We suggest you try both ways and report the better one.


Step 4: Improving Your Classifiers with Prompting of PLMs  


Because existing methods (e.g., WeSTClass, LOTClass) use keyword matching or static token embeddings to generate pseudo labels for classifier training, using PLM prompting can

potentially improve the pseudo label quality with the contextualization power of PLMs. In this step, you will need to propose and implement your idea on how to leverage PLMs prompting

to improve the classifiers you get in the last step. Feel free to explore different types of models, such as MLM-based PLMs (BERT, RoBERTa), discriminative PLMs (ELECTRA), fine-tuned models like RoBERTa-MNLI, or large ones like ChatGPT (sorry we cannot provide access to OpenAI API).

You can either propose a completely new method or improve based on the one you used in step 3, and please make sure it is weakly-supervised, i.e., not using any labels. You may refer to

some recent papers to borrow some ideas:

Zhao et al., Pre-trained Language Models Can be Fully Zero-Shot Learners, in ACL 2023.

Park and Lee, LIME: Weakly-Supervised Text Classification Without Seeds, in COLING 2022.

Zhang et al., PromptClass: Weakly-Supervised Text Classification with Prompting Enhanced Noise-Robust Self-Training, arXiv:2305.13723.

Sun et al., Text Classification via Large Language Models, arXiv:2305.08377.


Do not simply use their code available on GitHub. In this step, we expect you to propose your own idea and implement it by yourself. You also need to test your

implementation on the two provided datasets and write a report.


Things to include in your report:

Describe your proposed idea to the level of details that others can reproduce your results

In one table, compare the prediction accuracy of your new classifier and the one in Step 3 on the validation samples for both datasets, News and Movies.

Provide analysis on your experimental results (add additional experiments or case studies if necessary) to explain why your idea can or cannot improve the performance.


Step 5: Submit your result


In this step, you will apply methods you implemented in Step 2-4 on two real-world datasets.

In your submission, you should include these files in a .zip:

yournetid_assignment.zip/

               |----------------report.pdf

               |----------------code/

                        |---------------- category_guided_classification/ (everything your implemented in step 4 and a readme)

               |----------------data/

                        |----------------movies/

                                         |----------------First 100 documents in news_train.txt after phrasal segmentation as: train_phrase.txt (Please do not submit all documents)

                                         |----------------Top-10 embeddings for each category and category name embeddings as: good_terms.txt, bad_terms.txt (Totally 11 lines per file, with category name

in the first line)

                                         |----------------Your classification results from Step 3 as: step_3_test_prediction.txt (it should have same lines as testing file with one predicted label ID in each line)

                                         |----------------Your classification results from Step 4 as: step_4_test_prediction.txt (it should have same lines as testing file with one predicted label ID in each line)

                        |----------------news/ ... similar as movies

Your submissions will be evaluated in follow aspects:

1. Segmented Corpus have a good amount of quality phrases. (20pts)

2. Meaningful representative phrases under each category. (20pts)

3. Good document classification results based on your embedding and classification algorithm. (20pts)

4. A clear report on your proposed method with performance comparison and experiment analysis. (30pts)

5. Comprehensive code and instruction to reproduce your results. (10pts)


Double check before you submit


Now you can submit your assignment through Canvas !!

Congratulations, you just finished the programming assignment of CS512!!


Frequently Asked Questions


Having problem running AutoPhrase. Here are several solutions you may try:

10/27/23, 6:33 AM Fall 2023 CS512 Assignment - CS 512 - Illinois Wiki

3/3


(1) Use Campus Cluster: Please check the instructions provided on Piazza for how to use it. Everyone in the course has access to the campus cluster (15). You need to request

compute nodes by using the srun/sbatch command to run any script, else it will be killed by the admin automatically. Please see documentation

(https://campuscluster.illinois.edu/resources/docs/user-guide/) on how to use modules and request compute nodes.

(2) Use Google Colab with terminal mode (e.g., lines started with ! will be executed in terminal, or you may already have access to terminal with Pro subscription). See Piazza for more instructions on how to run WeSTClass for Step 3 on Colab.


(3) If you want to use a Mac with ARM chip, you may need to install gcc 11+ instead and solve some dependency issues. Here is a relative

post: https://stackoverflow.com/questions/72758130/is-there-a-way-to-install-and-use-gcc-on-macbook-m1


(4) If you want to use Windows, use WSL Ubuntu (https://learn.microsoft.com/en-us/windows/wsl/install)


Grading for step 3 & 4:

Step 3 will be graded purely on the test performance. We will refer to our runs of WeSTClass on both corpora as standards for this step. You don't need to fix to WeSTClass and can

use other classifiers you want, but notice that your choice of classifier will be used as a baseline in step 4.

Step 4 will be graded on both the classification accuracy and your report. You should propose a way to improve classification accuracy with PLM prompting. Either improve based

on step 3 or propose a completely new method is fine. Besides a clear description of your proposed method and reported performances on dev set in your report, we expect you to

achieve better test performance than your classifier in step 3. If you cannot outperform the classifier in step 3, we will then grade it based on your error analysis and model insights in

your report.


PLMs in step 4:


Because we are focusing on the weakly-supervised setting, please do not use PLMs that are trained with task-related data. For example, directly using a BERT fine-tuned with

sentiment analysis data is not allowed. You should only use LMs that are generically pre-trained or only fine-tuned on other tasks' data that are widely available (e.g., RoBERTa-MNLI

is fine).


No labels


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp