Data Science, AI, Machine Learning

Resources for learning
Basic math
Datasets and analysis
Vendors

Amazon
Google
IBM
Microsoft
SAP

Learning and AI
Safety and AI
Security of AI
Applications

Imageprocessing
Translation
Natural Language Processing
Finance
Software development

Explainable AI

Resources for learning

Basics

Local files - math
Local files - big data

Kaggle intro

OECD.AI including tools and catalogue (for policy makers)

Basic math

Kernel

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems.[1] The general task of pattern analysis is to find and study general types of relations (for example clusters, rankings, principal components, correlations, classifications) in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

Datasets and analysis

Datasets

re3data- Registry of Research Data Repositories
Google Dataset Search- strong use of metadata for searching

Analysis

Timi.eu - Frank Vanden Berghen (ULB) - Belgian data science and mining (Anatella, R, Python, ...) - telecom, banking
Endor - predictive analytics
Caseware - IDEA

Apache

Apache Spark- a unified analytics engine for large-scale data processing, including map/reduce and machine learning
Apache Lucene- search and indexing

Lucene Core provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities
Solr is a high performance search server built using Lucene Core, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface
PyLucene is a Python port of the Core project

Apache Hadoop - framework for the distributed processing of large data sets across clusters of computers, basis for Cassandra, Spark and many others

Hadoop Common: common utilities that support the other Hadoop modules
Hadoop Distributed File System (HDFS): provides high-throughput access to application data
Hadoop YARN: A framework for job scheduling and cluster resource management
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets

Vendors

Amazon

Amazon- machine learning - SageMaker, DeepLens

Google

Google Research
Google - Colaboratory - Jupyter notebook environment/TensorFlow

Google AI
Google AI - research
Google AI- Brain

TensorFlow- ML toolkit

IBM

IBM Watson

Microsoft

Microsoft Virtual Academy- AI, ML, development, ...
Eric Horvitz- models of bounded rationality founded in probability and decision theory, Stanford

SAP

SAP developers
SAP- Machine Learning

Learning and AI

AI - Wikipedia
Machine Learning - Wikipedia

Unsupervised learning finds patterns in a stream of input.
Supervised learning requires a human to label the input data first, and comes in two main varieties: classification and numerical regression.
- Classification is used to determine what category something belongs in – the program sees a number of examples of things from several categories and will learn to classify new inputs.
- Regression is the attempt to produce a function that describes the relationship between inputs and outputs and predicts how the outputs should change as the inputs change.
Both classifiers and regression learners can be viewed as "function approximators" trying to learn an unknown (possibly implicit) function; for example, a spam classifier can be viewed as learning a function that maps from the text of an email to one of two categories, "spam" or "not spam".
In reinforcement learning the agent is rewarded for good responses and punished for bad ones. The agent classifies its responses to form a strategy for operating in its problem space.
Transfer learning is when the knowledge gained from one problem is applied to a new problem.

Reinforcement Learning - Wikipedia
OpenAI - both nonprofit + LLP

generative pre-training (GPT) -n
chatGPT
DALL-E and CLIP

Keras - a high-level neural networks API, written in Python, runs on top of TensorFlow, CNTK, or Theano

OpalProject - MIT, Orange, Telefonica, ...

deeplearning.net
deeplearning.net - Theano - Python library

IT - ExpertSystem- Text Analytics and Cognitive Computing, founded in Modena, 1989

IL - Thetaray- AML, Amir Averbuch with Yaron Blachman as VP Channels

Safety and AI

AI safety - Stanford - AI to do no harm

FORMAL METHODS - Using precise mathematical modeling to ensure the safety, security, and robustness of conventional software and hardware systems.
LEARNING AND CONTROL - Designing systems that intelligently balance learning under uncertainty and acting safety.
TRANSPARENCY - Understanding safety in the context of fairness, accountability, and explainability for autonomous and intelligent systems.

Security of AI

Security of ML and AI refs

Applications

Imageprocessing

Imagenet competition - rooted in Princeton DB
Imagenet website

Translation

DeepL - Wikipedia
Deepl website

Natural Language Processing (NLP)

Flair, NLP library - used e.g. by Zalando

Covers: named entity recognition (NER), sentiment analysis, part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification, ...

Sheffield NLP group
Sheffield GATE- General Architecture for Text Engineering - Open Source - tutorials etc
Stanford NLP- Open Source
NLTK- Open Source
OpenNLP Apache- Open Source
GATE- a full-lifecycle open source solution for text processing
UIMA Apache- Unstructured Information Management - Open Source
WordNet- Princeton's lexical database, 117 000 synsets
FrameNet- Berkeley lexical database of English, human- and machine-readable, based on annotating examples of how words are used in actual texts

Finance

Credit Risk - Early Warning System- ING, Google, PwC

Sofware development

KuafuAI DevOpsGPT - from requirements to code
KuafuAI.net -

Explainable AI

SHAP - 'SHapley Additive exPlanations' is a unified approach to explain the output of any machine learning model
DALEX - Descriptive mAchine Learning EXplanations
LIME - Local Interpretable Model-agnostic Explanations, Marco Tulio Ribeiro, Department of Computer Science and Engineering, University of Washington
Bulletproof.AI - Martin Rehak