Post

BigCode
BigCode@BigCodeProject·
SantaCoder is trained on Python, Java, and JavaScript and outperforms other large multilingual models such as InCoder (6.7B) or CodeGen-multi (2.7B) considerably! A lot of pieces from a lot of collaborators came together to get to that result:
BigCode tweet media
English
1
3
26
5.6K
BigCode
BigCode@BigCodeProject·
The foundation to train SantaCoder is The Stack (v1.1) dataset. Given the relatively small size of our model (1B parameters) we chose three popular programming languages: Python, Java, and JavaScript. You can check if your code was used for training here: huggingface.co/spaces/bigcode…
English
1
2
16
3.4K
BigCode
BigCode@BigCodeProject·
Before training any models we looked into removing sensitive information from code such as email addresses, secret keys and IP addresses. For that purpose we annotated 400 samples and then built and continuously refined RegEx rules to remove the information before training.
BigCode tweet media
English
1
0
13
2.7K
Paylaş