Tuesday, 21. April 2009

OrganiK project: working on testdata collection

As blogged in January, Gunnar, Remzi and I are working for DFKI on the Organik-Project. As true hard bloggin' scientists, we keep on reporting.

In the next two weeks, I will gather an exhaustive test-data collection of texts that we use for ontology learning. I hope to gather around 10.000 documents from various sources that have a topic overlap. We need e-mails, office documents (contracts, etc) and news documents. There are a lot of test data sets out there, the question is now to pick the right one. Also, in OrganiK we have SME partners who could provide some data.

After this, the next step will be to create a taxonomy learning module that analyses the documents and semi-automatically (or fully automatically) creates a taxonomy out of it for future classification. If its fully automatic, I expect that the taxonomy will have probabilistic elements in it ("it thinks that this is a customer, but only 60%"). If we work with a probabilistic model throughout the whole project, we can rank everything all the time, maybe this will reduce human work. We will see.
Anyone has experience with taxonomies that have a weight added? Its similar to a TF/IDF rank.
QR barcode by i-nigma.com/CreateBarcodes
swatbolish - 20. Feb, 14:17

That would be great, but collection is not an easy job. registry cleaner reviews

Martink23 - 21. Feb, 06:07

Wow, I commend you on setting such a high goal and am looking forward to seeing it all when you get it all together. I couldn't imagine trying to gather 10,000 documents from various sources! Registry Cleaner

Trackback URL:
https://leobard.twoday.net/stories/5656937/modTrackback

icon

semantic weltbild 2.0

Building the Semantic Web is easier together

and then...

foaf explorer
foaf

Geo Visitors Map
I am a hard bloggin' scientist. Read the Manifesto.
www.flickr.com
lebard's photos More of lebard's photos
Skype Me™!

Search

 

Users Status

You are not logged in.

I support

Wikipedia Affiliate Button

Archive

April 2009
Sun
Mon
Tue
Wed
Thu
Fri
Sat
 
 
 
 1 
 2 
 3 
 4 
 5 
 6 
 8 
 9 
10
11
12
14
15
16
17
18
19
23
24
25
27
28
29
30
 
 
 

Credits


austriaca
Chucknorrism
digitalcouch
gnowsis
Jesus
NeueHeimat
route planning
SemWeb
travel
zoot
Profil
Logout
Subscribe Weblog