I am an ML research engineer at Ford Motor Company where I work on computer vision and machine learning for perception features in the context of automated driving. Most of my work is on camera images and LiDAR point clouds.
In my free time I enjoy playing/watching soccer, kickboxing, hiking (waterfall hikes are the best!) and practically any outdoor sport.
Email |
LinkedIn |
Github
Google Scholar |
Twitter
Aim is to bridge the domain gap in de-identification of health records using unlabeled data from the target domain via a self-training framework.
They use labeled data from source domain + unlabeled data from target domain.
F1 score is used as evaluation metric
Hartman et al used unlabeled data to fine-tune the word-embedding layer, while our self-training framework uses unlabeled data to retrain the entire model.
Prior to release, the four publicly available datasets were all manually de-identified, which replaced personal identifiers with plausible but realistic surrogate information; we evaluate our systems on this surrogate Personal Health Information (PHI) with plausible words, e.g., replacing the patient’s name with unique common names.
They stress the Importance of adding noise during training but remove the noise when generating pseudo-labels. (don’t exactly understand their reasoning).
They use data augmentation (word level and character level switch-out) and activation dropout as noise sources.
self-training (after 20 epochs of retraining) helps to correct the whole entity when it contains at least 1 correct word. Specifically, among entities with at least 1 correct word, 37.8% are corrected by self-training, otherwise only 11.8% are corrected. Intuitively, if an entity has a correct word, the self-training framework can make the whole entity more compatible with the target domain, which finally corrects the whole entity.
For the cases in which the predictions are very wrong, the framework does not significantly improve the performance.
It was observed that a few additional labeled samples significantly boosted domain adaptation performance.