semantic weltbild 2.0: OrganiK project: working on testdata collection

Tuesday, 21. April 2009

OrganiK project: working on testdata collection

As blogged in January, Gunnar, Remzi and I are working for DFKI on the Organik-Project. As true hard bloggin' scientists, we keep on reporting.

In the next two weeks, I will gather an exhaustive test-data collection of texts that we use for ontology learning. I hope to gather around 10.000 documents from various sources that have a topic overlap. We need e-mails, office documents (contracts, etc) and news documents. There are a lot of test data sets out there, the question is now to pick the right one. Also, in OrganiK we have SME partners who could provide some data.

After this, the next step will be to create a taxonomy learning module that analyses the documents and semi-automatically (or fully automatically) creates a taxonomy out of it for future classification. If its fully automatic, I expect that the taxonomy will have probabilistic elements in it ("it thinks that this is a customer, but only 60%"). If we work with a probabilistic model throughout the whole project, we can rank everything all the time, maybe this will reduce human work. We will see.
Anyone has experience with taxonomies that have a weight added? Its similar to a TF/IDF rank.

leobard - 21. Apr, 13:37

2 comments - add comment

QR barcode by i-nigma.com/CreateBarcodes

swatbolish - 20. Feb, 14:17

That would be great, but collection is not an easy job. registry cleaner reviews

Martink23 - 21. Feb, 06:07

Wow, I commend you on setting such a high goal and am looking forward to seeing it all when you get it all together. I couldn't imagine trying to gather 10,000 documents from various sources! Registry Cleaner

- add comment - 0 trackbacks

Trackback URL:
https://leobard.twoday.net/stories/5656937/modTrackback