Data Science, AI, Machine Learning
Resources for learning
Basics
Basic math
- Kernel
- In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems.[1] The general task of pattern analysis is to find and study general types of relations (for example clusters, rankings, principal components, correlations, classifications) in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.
Datasets and analysis
Datasets
Analysis
- Timi.eu - Frank Vanden Berghen (ULB) - Belgian data science and mining (Anatella, R, Python, ...) - telecom, banking
- Endor - predictive analytics
- Caseware - IDEA
Apache
- Apache Spark- a unified analytics engine for large-scale data processing, including map/reduce and machine learning
- Apache Lucene- search and indexing
- Lucene Core provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities
- Solr is a high performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface
- PyLucene is a Python port of the Core project
- Apache Hadoop - framework for the distributed processing of large data sets across clusters of computers, basis for Cassandra, Spark and many others
- Hadoop Common: common utilities that support the other Hadoop modules
- Hadoop Distributed File System (HDFS): provides high-throughput access to application data
- Hadoop YARN: A framework for job scheduling and cluster resource management
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets
Vendors
Amazon
- Amazon- machine learning - SageMaker, DeepLens
Google
IBM
Microsoft
SAP
Learning and AI
- AI - Wikipedia
- Machine Learning - Wikipedia
- Unsupervised learning finds patterns in a stream of input.
- Supervised learning requires a human to label the input data first, and comes in two main varieties: classification and numerical regression.
- Classification is used to determine what category something belongs in – the program sees a number of examples of things from several categories and will learn to classify new inputs.
- Regression is the attempt to produce a function that describes the relationship between inputs and outputs and predicts how the outputs should change as the inputs change.
- Both classifiers and regression learners can be viewed as "function approximators" trying to learn an unknown (possibly implicit) function; for example, a spam classifier can be viewed as learning a function that maps from the text of an email to one of two categories, "spam" or "not spam".
- In reinforcement learning the agent is rewarded for good responses and punished for bad ones. The agent classifies its responses to form a strategy for operating in its problem space.
- Transfer learning is when the knowledge gained from one problem is applied to a new problem.
- Reinforcement Learning - Wikipedia
- OpenAI - ChatGPT - both nonprofit + LLP
- log-in with Google
- generative pre-training (GPT) -n
- chatGPT
- DALL-E and CLIP
- Keras - a high-level neural networks API, written in Python, runs on top of TensorFlow, CNTK, or Theano
- OpalProject - MIT, Orange, Telefonica, ...
- deeplearning.net
- deeplearning.net - Theano - Python library
- IT - ExpertSystem- Text Analytics and Cognitive Computing, founded in Modena, 1989
- IL - Thetaray- AML, Amir Averbuch with Yaron Blachman as VP Channels
Safety and AI
- AI safety - Stanford - AI to do no harm
- FORMAL METHODS - Using precise mathematical modeling to ensure the safety, security, and robustness of conventional software and hardware systems.
- LEARNING AND CONTROL - Designing systems that intelligently balance learning under uncertainty and acting safety.
- TRANSPARENCY - Understanding safety in the context of fairness, accountability, and explainability for autonomous and intelligent systems.
Security of AI
Applications
Imageprocessing
Translation
Natural Language Processing (NLP)
- Flair, NLP library - used e.g. by Zalando
- Covers: named entity recognition (NER), sentiment analysis, part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, ...
- Sheffield NLP group
- Sheffield GATE- General Architecture for Text Engineering - Open Source - tutorials etc
- Stanford NLP- Open Source
- NLTK- Open Source
- OpenNLP Apache- Open Source
- GATE- a full-lifecycle open source solution for text processing
- UIMA Apache- Unstructured Information Management - Open Source
- WordNet- Princeton's lexical database, 117 000 synsets
- FrameNet- Berkeley lexical database of English, human- and machine-readable, based on annotating examples of how words are used in actual texts
Finance
Sofware development
Explainable AI
- SHAP - 'SHapley Additive exPlanations' is a unified approach to explain the output of any machine learning model
- DALEX - Descriptive mAchine Learning EXplanations
- LIME - Local Interpretable Model-agnostic Explanations, Marco Tulio Ribeiro, Department of Computer Science and Engineering, University of Washington
- Bulletproof.AI - Martin Rehak